Source

Chapter 6. Exceptions and File Handling

In this chapter, you will dive into exceptions, file objects, for loops, and the os and sys modules. If you’ve used exceptions in another programming language, you can skim the first section to get a sense of Python’s syntax. Be sure to tune in again for file handling.

6.1. Handling Exceptions

Like many other programming languages, Python has exception handling via try...except blocks.

Note: Python vs. Java exception handling Python uses try...except to handle exceptions and raise to generate them. Java and C++ use try...catch to handle exceptions, and throw to generate them.

Exceptions are everywhere in Python. Virtually every module in the standard Python library uses them, and Python itself will raise them in a lot of different circumstances. You’ve already seen them repeatedly throughout this book.

  • Accessing a non-existent dictionary key will raise a KeyError exception.
  • Searching a list for a non-existent value will raise a ValueError exception.
  • Calling a non-existent method will raise an AttributeError exception.
  • Referencing a non-existent variable will raise a NameError exception.
  • Mixing datatypes without coercion will raise a TypeError exception.

In each of these cases, you were simply playing around in the Python IDE: an error occurred, the exception was printed (depending on your IDE, perhaps in an intentionally jarring shade of red), and that was that. This is called an unhandled exception. When the exception was raised, there was no code to explicitly notice it and deal with it, so it bubbled its way back to the default behavior built in to Python, which is to spit out some debugging information and give up. In the IDE, that’s no big deal, but if that happened while your actual Python program was running, the entire program would come to a screeching halt.

An exception doesn’t need result in a complete program crash, though. Exceptions, when raised, can be handled. Sometimes an exception is really because you have a bug in your code (like accessing a variable that doesn’t exist), but many times, an exception is something you can anticipate. If you’re opening a file, it might not exist. If you’re connecting to a database, it might be unavailable, or you might not have the correct security credentials to access it. If you know a line of code may raise an exception, you should handle the exception using a try...except block.

Example 6.1. Opening a Non-Existent File

>>> fsock = open("/notthere", "r")      (1)
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
IOError: [Errno 2] No such file or directory: '/notthere'
>>> try:
...     fsock = open("/notthere")       (2)
... except IOError:                     (3)
...     print "The file does not exist, exiting gracefully"
... print "This line will always print" (4)
The file does not exist, exiting gracefully
This line will always print
  1. Using the built-in open function, you can try to open a file for reading (more on open in the next section). But the file doesn’t exist, so this raises the IOError exception. Since you haven’t provided any explicit check for an IOError exception, Python just prints out some debugging information about what happened and then gives up.
  2. You’re trying to open the same non-existent file, but this time you’re doing it within a try...except block.
  3. When the open method raises an IOError exception, you’re ready for it. The except IOError: line catches the exception and executes your own block of code, which in this case just prints a more pleasant error message.
  4. Once an exception has been handled, processing continues normally on the first line after the try...except block. Note that this line will always print, whether or not an exception occurs. If you really did have a file called notthere in your root directory, the call to open would succeed, the except clause would be ignored, and this line would still be executed.

Exceptions may seem unfriendly (after all, if you don’t catch the exception, your entire program will crash), but consider the alternative. Would you rather get back an unusable file object to a non-existent file? You’d need to check its validity somehow anyway, and if you forgot, somewhere down the line, your program would give you strange errors somewhere down the line that you would need to trace back to the source. I’m sure you’ve experienced this, and you know it’s not fun. With exceptions, errors occur immediately, and you can handle them in a standard way at the source of the problem.

6.1.1. Using Exceptions For Other Purposes

There are a lot of other uses for exceptions besides handling actual error conditions. A common use in the standard Python library is to try to import a module, and then check whether it worked. Importing a module that does not exist will raise an ImportError exception. You can use this to define multiple levels of functionality based on which modules are available at run-time, or to support multiple platforms (where platform-specific code is separated into different modules).

You can also define your own exceptions by creating a class that inherits from the built-in Exception class, and then raise your exceptions with the raise command. See the further reading section if you’re interested in doing this.

The next example demonstrates how to use an exception to support platform-specific functionality. This code comes from the getpass module, a wrapper module for getting a password from the user. Getting a password is accomplished differently on UNIX, Windows, and Mac OS platforms, but this code encapsulates all of those differences.

Example 6.2. Supporting Platform-Specific Functionality

# Bind the name getpass to the appropriate function
try:
    import termios, TERMIOS                     (1)
except ImportError:
    try:
        import msvcrt                           (2)
    except ImportError:
        try:
            from EasyDialogs import AskPassword (3)
        except ImportError:
            getpass = default_getpass           (4)
        else:                                   (5)
            getpass = AskPassword
    else:
        getpass = win_getpass
else:
    getpass = unix_getpass
  1. termios is a UNIX-specific module that provides low-level control over the input terminal. If this module is not available (because it’s not on your system, or your system doesn’t support it), the import fails and Python raises an ImportError, which you catch.
  2. OK, you didn’t have termios, so let’s try msvcrt, which is a Windows-specific module that provides an API to many useful functions in the Microsoft Visual C++ runtime services. If this import fails, Python will raise an ImportError, which you catch.
  3. If the first two didn’t work, you try to import a function from EasyDialogs, which is a Mac OS-specific module that provides functions to pop up dialog boxes of various types. Once again, if this import fails, Python will raise an ImportError, which you catch.
  4. None of these platform-specific modules is available (which is possible, since Python has been ported to a lot of different platforms), so you need to fall back on a default password input function (which is defined elsewhere in the getpass module). Notice what you’re doing here: assigning the function default_getpass to the variable getpass. If you read the official getpass documentation, it tells you that the getpass module defines a getpass function. It does this by binding getpass to the correct function for your platform. Then when you call the getpass function, you’re really calling a platform-specific function that this code has set up for you. You don’t need to know or care which platform your code is running on – just call getpass, and it will always do the right thing.
  5. A try...except block can have an else clause, like an if statement. If no exception is raised during the try block, the else clause is executed afterwards. In this case, that means that the from EasyDialogs import AskPassword import worked, so you should bind getpass to the AskPassword function. Each of the other try...except blocks has similar else clauses to bind getpass to the appropriate function when you find an import that works.

Further Reading on Exception Handling

6.2. Working with File Objects

Python has a built-in function, open, for opening a file on disk. open returns a file object, which has methods and attributes for getting information about and manipulating the opened file.

Example 6.3. Opening a File

>>> f = open("/music/_singles/kairo.mp3", "rb") (1)
>>> f                                           (2)
<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
>>> f.mode                                      (3)
'rb'
>>> f.name                                      (4)
'/music/_singles/kairo.mp3'
  1. The open method can take up to three parameters: a filename, a mode, and a buffering parameter. Only the first one, the filename, is required; the other two are optional. If not specified, the file is opened for reading in text mode. Here you are opening the file for reading in binary mode. (print open.__doc__ displays a great explanation of all the possible modes.)
  2. The open function returns an object (by now, this should not surprise you). A file object has several useful attributes.
  3. The mode attribute of a file object tells you in which mode the file was opened.
  4. The name attribute of a file object tells you the name of the file that the file object has open.

6.2.1. Reading Files

After you open a file, the first thing you’ll want to do is read from it, as shown in the next example.

Example 6.4. Reading a File

>>> f
<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
>>> f.tell()              (1)
0
>>> f.seek(-128, 2)       (2)
>>> f.tell()              (3)
7542909
>>> tagData = f.read(128) (4)
>>> tagData
'TAGKAIRO****THE BEST GOA         ***DJ MARY-JANE***
Rave Mix                      2000http://mp3.com/DJMARYJANE     \037'
>>> f.tell()              (5)
7543037
  1. A file object maintains state about the file it has open. The tell method of a file object tells you your current position in the open file. Since you haven’t done anything with this file yet, the current position is 0, which is the beginning of the file.
  2. The seek method of a file object moves to another position in the open file. The second parameter specifies what the first one means; 0 means move to an absolute position (counting from the start of the file), 1 means move to a relative position (counting from the current position), and 2 means move to a position relative to the end of the file. Since the MP3 tags you’re looking for are stored at the end of the file, you use 2 and tell the file object to move to a position 128 bytes from the end of the file.
  3. The tell method confirms that the current file position has moved.
  4. The read method reads a specified number of bytes from the open file and returns a string with the data that was read. The optional parameter specifies the maximum number of bytes to read. If no parameter is specified, read will read until the end of the file. (You could have simply said read() here, since you know exactly where you are in the file and you are, in fact, reading the last 128 bytes.) The read data is assigned to the tagData variable, and the current position is updated based on how many bytes were read.
  5. The tell method confirms that the current position has moved. If you do the math, you’ll see that after reading 128 bytes, the position has been incremented by 128.

6.2.2. Closing Files

Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It’s important to close files as soon as you’re finished with them.

Example 6.5. Closing a File

>>> f
<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
>>> f.closed       (1)
False
>>> f.close()      (2)
>>> f
<closed file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
>>> f.closed       (3)
True
>>> f.seek(0)      (4)
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
ValueError: I/O operation on closed file
>>> f.tell()
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
ValueError: I/O operation on closed file
>>> f.read()
Traceback (innermost last):
  File "<interactive input>", line 1, in ?
ValueError: I/O operation on closed file
>>> f.close()      (5)
  1. The closed attribute of a file object indicates whether the object has a file open or not. In this case, the file is still open (closed is False).
  2. To close a file, call the close method of the file object. This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any) that the system hadn’t gotten around to actually writing yet, and releases the system resources.
  3. The closed attribute confirms that the file is closed.
  4. Just because a file is closed doesn’t mean that the file object ceases to exist. The variable f will continue to exist until it goes out of scope or gets manually deleted. However, none of the methods that manipulate an open file will work once the file has been closed; they all raise an exception.
  5. Calling close on a file object whose file is already closed does not raise an exception; it fails silently.

6.2.3. Handling I/O Errors

Now you’ve seen enough to understand the file handling code in the fileinfo.py sample code from teh previous chapter. This example shows how to safely open and read from a file and gracefully handle errors.

Example 6.6. File Objects in MP3FileInfo

try:                                (1)
    fsock = open(filename, "rb", 0) (2)
    try:
        fsock.seek(-128, 2)         (3)
        tagdata = fsock.read(128)   (4)
    finally:                        (5)
        fsock.close()
    .
    .
    .
except IOError:                     (6)
    pass
  1. Because opening and reading files is risky and may raise an exception, all of this code is wrapped in a try...except block. (Hey, isn’t standardized indentation great? This is where you start to appreciate it.)
  2. The open function may raise an IOError. (Maybe the file doesn’t exist.)
  3. The seek method may raise an IOError. (Maybe the file is smaller than 128 bytes.)
  4. The read method may raise an IOError. (Maybe the disk has a bad sector, or it’s on a network drive and the network just went down.)
  5. This is new: a try...finally block. Once the file has been opened successfully by the open function, you want to make absolutely sure that you close it, even if an exception is raised by the seek or read methods. That’s what a try...finally block is for: code in the finally block will always be executed, even if something in the try block raises an exception. Think of it as code that gets executed on the way out, regardless of what happened before.
  6. At last, you handle your IOError exception. This could be the IOError exception raised by the call to open, seek, or read. Here, you really don’t care, because all you’re going to do is ignore it silently and continue. (Remember, pass is a Python statement that does nothing.) That’s perfectly legal; “handling” an exception can mean explicitly doing nothing. It still counts as handled, and processing will continue normally on the next line of code after the try...except block.

6.2.4. Writing to Files

As you would expect, you can also write to files in much the same way that you read from them. There are two basic file modes:

  • “Append” mode will add data to the end of the file.
  • “write” mode will overwrite the file.

Either mode will create the file automatically if it doesn’t already exist, so there’s never a need for any sort of fiddly “if the log file doesn’t exist yet, create a new empty file just so you can open it for the first time” logic. Just open it and start writing.

Example 6.7. Writing to Files

>>> logfile = open('test.log', 'w') (1)
>>> logfile.write('test succeeded') (2)
>>> logfile.close()
>>> print file('test.log').read()   (3)
test succeeded
>>> logfile = open('test.log', 'a') (4)
>>> logfile.write('line 2')
>>> logfile.close()
>>> print file('test.log').read()   (5)
test succeededline 2
  1. You start boldly by creating either the new file test.log or overwrites the existing file, and opening the file for writing. (The second parameter “w” means open the file for writing.) Yes, that’s all as dangerous as it sounds. I hope you didn’t care about the previous contents of that file, because it’s gone now.
  2. You can add data to the newly opened file with the write method of the file object returned by open.
  3. file is a synonym for open. This one-liner opens the file, reads its contents, and prints them.
  4. You happen to know that test.log exists (since you just finished writing to it), so you can open it and append to it. (The “a” parameter means open the file for appending.) Actually you could do this even if the file didn’t exist, because opening the file for appending will create the file if necessary. But appending will never harm the existing contents of the file.
  5. As you can see, both the original line you wrote and the second line you appended are now in test.log. Also note that carriage returns are not included. Since you didn’t write them explicitly to the file either time, the file doesn’t include them. You can write a carriage return with the “n” character. Since you didn’t do this, everything you wrote to the file ended up smooshed together on the same line.

Further Reading on File Handling

6.3. Iterating with for Loops

Like most other languages, Python has for loops. The only reason you haven’t seen them until now is that Python is good at so many other things that you don’t need them as often.

Most other languages don’t have a powerful list datatype like Python, so you end up doing a lot of manual work, specifying a start, end, and step to define a range of integers or characters or other iteratable entities. But in Python, a for loop simply iterates over a list, the same way list comprehensions work.

Example 6.8. Introducing the for Loop

>>> li = ['a', 'b', 'e']
>>> for s in li:         (1)
...     print s          (2)
a
b
e
>>> print "\n".join(li)  (3)
a
b
e
  1. The syntax for a for loop is similar to list comprehensions. li is a list, and s will take the value of each element in turn, starting from the first element.
  2. Like an if statement or any other indented block, a for loop can have any number of lines of code in it.
  3. This is the reason you haven’t seen the for loop yet: you haven’t needed it yet. It’s amazing how often you use for loops in other languages when all you really want is a join or a list comprehension.

Doing a “normal” (by Visual Basic standards) counter for loop is also simple.

Example 6.9. Simple Counters

>>> for i in range(5):             (1)
...     print i
0
1
2
3
4
>>> li = ['a', 'b', 'c', 'd', 'e']
>>> for i in range(len(li)):       (2)
...     print li[i]
a
b
c
d
e
  1. As you saw in Example 3.20, ??Assigning Consecutive Values??, range produces a list of integers, which you then loop through. I know it looks a bit odd, but it is occasionally (and I stress occasionally) useful to have a counter loop.
  2. Don’t ever do this. This is Visual Basic-style thinking. Break out of it. Just iterate through the list, as shown in the previous example.

for loops are not just for simple counters. They can iterate through all kinds of things. Here is an example of using a for loop to iterate through a dictionary.

Example 6.10. Iterating Through a Dictionary

>>> import os
>>> for k, v in os.environ.items():      (1) (2)
...     print "%s=%s" % (k, v)
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
USERNAME=mpilgrim

[...snip...]

>>> print "\n".join(["%s=%s" % (k, v)
...     for k, v in os.environ.items()]) (3)
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
USERNAME=mpilgrim

[...snip...]

  1. os.environ is a dictionary of the environment variables defined on your system. In Windows, these are your user and system variables accessible from MS-DOS. In UNIX, they are the variables exported in your shell’s startup scripts. In Mac OS, there is no concept of environment variables, so this dictionary is empty.
  2. os.environ.items() returns a list of tuples: [(key1, value1), (key2, value2), ...]. The for loop iterates through this list. The first round, it assigns key1 to k and value1 to v, so k = USERPROFILE and v = C:Documents and Settingsmpilgrim. In the second round, k gets the second key, OS, and v gets the corresponding value, Windows_NT.
  3. With multi-variable assignment and list comprehensions, you can replace the entire for loop with a single statement. Whether you actually do this in real code is a matter of personal coding style. I like it because it makes it clear that what I’m doing is mapping a dictionary into a list, then joining the list into a single string. Other programmers prefer to write this out as a for loop. The output is the same in either case, although this version is slightly faster, because there is only one print statement instead of many.

Now we can look at the for loop in MP3FileInfo, from the sample fileinfo.py program introduced in Chapter 5.

Example 6.11. for Loop in MP3FileInfo

tagDataMap = {"title"   : (  3,  33, stripnulls),
              "artist"  : ( 33,  63, stripnulls),
              "album"   : ( 63,  93, stripnulls),
              "year"    : ( 93,  97, stripnulls),
              "comment" : ( 97, 126, stripnulls),
              "genre"   : (127, 128, ord)}                               (1)
.
.
.
        if tagdata[:3] == "TAG":
            for tag, (start, end, parseFunc) in self.tagDataMap.items(): (2)
                self[tag] = parseFunc(tagdata[start:end])                (3)
  1. tagDataMap is a class attribute that defines the tags you’re looking for in an MP3 file. Tags are stored in fixed-length fields. Once you read the last 128 bytes of the file, bytes 3 through 32 of those are always the song title, 33 through 62 are always the artist name, 63 through 92 are the album name, and so forth. Note that tagDataMap is a dictionary of tuples, and each tuple contains two integers and a function reference.
  2. This looks complicated, but it’s not. The structure of the for variables matches the structure of the elements of the list returned by items. Remember that items returns a list of tuples of the form (key, value). The first element of that list is (“title”, (3, 33, <function stripnulls>)), so the first time around the loop, tag gets “title”, start gets 3, end gets 33, and parseFunc gets the function stripnulls.
  3. Now that you’ve extracted all the parameters for a single MP3 tag, saving the tag data is easy. You slice tagdata from start to end to get the actual data for this tag, call parseFunc to post-process the data, and assign this as the value for the key tag in the pseudo-dictionary self. After iterating through all the elements in tagDataMap, self has the values for all the tags, and you know what that looks like.

6.4. Using sys.modules

Modules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary sys.modules.

Example 6.12. Introducing sys.modules

>>> import sys                          (1)
>>> print '\n'.join(sys.modules.keys()) (2)
win32api
os.path
os
exceptions
__main__
ntpath
nt
sys
__builtin__
site
signal
UserDict
stat
  1. The sys module contains system-level information, such as the version of Python you’re running (sys.version or sys.version_info), and system-level options such as the maximum allowed recursion depth (sys.getrecursionlimit () and sys.setrecursionlimit()).
  2. sys.modules is a dictionary containing all the modules that have ever been imported since Python was started; the key is the module name, the value is the module object. Note that this is more than just the modules your program has imported. Python preloads some modules on startup, and if you’re using a Python IDE, sys.modules contains all the modules imported by all the programs you’ve run within the IDE.

This example demonstrates how to use sys.modules.

Example 6.13. Using sys.modules

>>> import fileinfo         (1)
>>> print '\n'.join(sys.modules.keys())
win32api
os.path
os
fileinfo
exceptions
__main__
ntpath
nt
sys
__builtin__
site
signal
UserDict
stat
>>> fileinfo
<module 'fileinfo' from 'fileinfo.pyc'>
>>> sys.modules["fileinfo"] (2)
<module 'fileinfo' from 'fileinfo.pyc'>
  1. As new modules are imported, they are added to sys.modules. This explains why importing the same module twice is very fast: Python has already loaded and cached the module in sys.modules, so importing the second time is simply a dictionary lookup.
  2. Given the name (as a string) of any previously-imported module, you can get a reference to the module itself through the sys.modules dictionary.

The next example shows how to use the __module__ class attribute with the sys.modules dictionary to get a reference to the module in which a class is defined.

Example 6.14. The __module__ Class Attribute

>>> from fileinfo import MP3FileInfo
>>> MP3FileInfo.__module__              (1)
'fileinfo'
>>> sys.modules[MP3FileInfo.__module__] (2)
<module 'fileinfo' from 'fileinfo.pyc'>
  1. Every Python class has a built-in class attribute __module__, which is the name of the module in which the class is defined.
  2. Combining this with the sys.modules dictionary, you can get a reference to the module in which a class is defined.

Now you’re ready to see how sys.modules is used in fileinfo.py, the sample program introduced in Chapter 5. This example shows that portion of the code.

Example 6.15. sys.modules in fileinfo.py

def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):       (1)
    "get file info class from filename extension"
    subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]        (2)
    return hasattr(module, subclass) and getattr(module, subclass) or FileInfo (3)
  1. This is a function with two arguments; filename is required, but module is optional and defaults to the module that contains the FileInfo class. This looks inefficient, because you might expect Python to evaluate the sys.modules expression every time the function is called. In fact, Python evaluates default expressions only once, the first time the module is imported. As you’ll see later, you never call this function with a module argument, so module serves as a function-level constant.
  2. You’ll plow through this line later, after you dive into the os module. For now, take it on faith that subclass ends up as the name of a class, like MP3FileInfo.
  3. You already know about getattr, which gets a reference to an object by name. hasattr is a complementary function that checks whether an object has a particular attribute; in this case, whether a module has a particular class (although it works for any object and any attribute, just like getattr). In English, this line of code says, “If this module has the class named by subclass then return it, otherwise return the base class FileInfo. “

Further Reading on Modules

6.5. Working with Directories

The os.path module has several functions for manipulating files and directories. Here, we’re looking at handling pathnames and listing the contents of a directory.

Example 6.16. Constructing Pathnames

>>> import os
>>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3") (1) (2)
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.join("c:\\music\\ap", "mahadeva.mp3")   (3)
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.expanduser("~")                         (4)
'c:\\Documents and Settings\\mpilgrim\\My Documents'
>>> os.path.join(os.path.expanduser("~"), "Python") (5)
'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'
  1. os.path is a reference to a module – which module depends on your platform. Just as getpass encapsulates differences between platforms by setting getpass to a platform-specific function, os encapsulates differences between platforms by setting path to a platform-specific module.
  2. The join function of os.path constructs a pathname out of one or more partial pathnames. In this case, it simply concatenates strings. (Note that dealing with pathnames on Windows is annoying because the backslash character must be escaped.)
  3. In this slightly less trivial case, join will add an extra backslash to the pathname before joining it to the filename. I was overjoyed when I discovered this, since addSlashIfNecessary is one of the stupid little functions I always need to write when building up my toolbox in a new language. Do not write this stupid little function in Python; smart people have already taken care of it for you.
  4. expanduser will expand a pathname that uses ~ to represent the current user’s home directory. This works on any platform where users have a home directory, like Windows, UNIX, and Mac OS X; it has no effect on Mac OS.
  5. Combining these techniques, you can easily construct pathnames for directories and files under the user’s home directory.

Example 6.17. Splitting Pathnames

>>> os.path.split("c:\\music\\ap\\mahadeva.mp3")                        (1)
('c:\\music\\ap', 'mahadeva.mp3')
>>> (filepath, filename) = os.path.split("c:\\music\\ap\\mahadeva.mp3") (2)
>>> filepath                                                            (3)
'c:\\music\\ap'
>>> filename                                                            (4)
'mahadeva.mp3'
>>> (shortname, extension) = os.path.splitext(filename)                 (5)
>>> shortname
'mahadeva'
>>> extension
'.mp3'
  1. The split function splits a full pathname and returns a tuple containing the path and filename. Remember when I said you could use multi-variable assignment to return multiple values from a function? Well, split is such a function.
  2. You assign the return value of the split function into a tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple.
  3. The first variable, filepath, receives the value of the first element of the tuple returned from split, the file path.
  4. The second variable, filename, receives the value of the second element of the tuple returned from split, the filename.
  5. os.path also contains a function splitext, which splits a filename and returns a tuple containing the filename and the file extension. You use the same technique to assign each of them to separate variables.

Example 6.18. Listing Directories

>>> os.listdir("c:\\music\\_singles\\")              (1)
['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
'spinning.mp3']
>>> dirname = "c:\\"
>>> os.listdir(dirname)                              (2)
['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'cygwin',
'docbook', 'Documents and Settings', 'Incoming', 'Inetpub', 'IO.SYS',
'MSDOS.SYS', 'Music', 'NTDETECT.COM', 'ntldr', 'pagefile.sys',
'Program Files', 'Python20', 'RECYCLER',
'System Volume Information', 'TEMP', 'WINNT']
>>> [f for f in os.listdir(dirname)
...     if os.path.isfile(os.path.join(dirname, f))] (3)
['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'IO.SYS', 'MSDOS.SYS',
'NTDETECT.COM', 'ntldr', 'pagefile.sys']
>>> [f for f in os.listdir(dirname)
...     if os.path.isdir(os.path.join(dirname, f))]  (4)
['cygwin', 'docbook', 'Documents and Settings', 'Incoming',
'Inetpub', 'Music', 'Program Files', 'Python20', 'RECYCLER',
'System Volume Information', 'TEMP', 'WINNT']
  1. The listdir function takes a pathname and returns a list of the contents of the directory.
  2. listdir returns both files and folders, with no indication of which is which.
  3. You can use list filtering and the isfile function of the os.path module to separate the files from the folders. isfile takes a pathname and returns 1 if the path represents a file, and 0 otherwise. Here you’re using os.path.join to ensure a full pathname, but isfile also works with a partial path, relative to the current working directory. You can use os.getcwd() to get the current working directory.
  4. os.path also has a isdir function which returns 1 if the path represents a directory, and 0 otherwise. You can use this to get a list of the subdirectories within a directory.

Example 6.19. Listing Directories in fileinfo.py

def listDirectory(directory, fileExtList):
    "get list of file info objects for files of particular extensions"
    fileList = [os.path.normcase(f)
                for f in os.listdir(directory)]            (1) (2)
    fileList = [os.path.join(directory, f)
               for f in fileList
                if os.path.splitext(f)[1] in fileExtList]  (3) (4) (5)
  1. os.listdir(directory) returns a list of all the files and folders in directory.

  2. Iterating through the list with f, you use os.path.normcase(f) to normalize the case according to operating system defaults. normcase is a useful little function that compensates for case-insensitive operating systems that think that mahadeva.mp3 and mahadeva.MP3 are the same file. For instance, on Windows and Mac OS, normcase will convert the entire filename to lowercase; on UNIX-compatible systems, it will return the filename unchanged.

  3. Iterating through the normalized list with f again, you use os.path.splitext(f) to split each filename into name and extension.

  4. For each file, you see if the extension is in the list of file extensions you care about (fileExtList, which was passed to the listDirectory function).

  5. For each file you care about, you use os.path.join(directory, f) to construct the full pathname of the file, and return a list of the full pathnames.

    Note: Whenever possible, you should use the functions in os and os.path for file, directory, and path manipulations. These modules are wrappers for platform-specific modules, so functions like os.path.split work on UNIX, Windows, Mac OS, and any other platform supported by Python.

There is one other way to get the contents of a directory. It’s very powerful, and it uses the sort of wildcards that you may already be familiar with from working on the command line.

Example 6.20. Listing Directories with glob

>>> os.listdir("c:\\music\\_singles\\")               (1)
['a_time_long_forgotten_con.mp3', 'hellraiser.mp3',
'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3',
'spinning.mp3']
>>> import glob
>>> glob.glob('c:\\music\\_singles\\*.mp3')           (2)
['c:\\music\\_singles\\a_time_long_forgotten_con.mp3',
'c:\\music\\_singles\\hellraiser.mp3',
'c:\\music\\_singles\\kairo.mp3',
'c:\\music\\_singles\\long_way_home1.mp3',
'c:\\music\\_singles\\sidewinder.mp3',
'c:\\music\\_singles\\spinning.mp3']
>>> glob.glob('c:\\music\\_singles\\s*.mp3')          (3)
['c:\\music\\_singles\\sidewinder.mp3',
'c:\\music\\_singles\\spinning.mp3']
>>> glob.glob('c:\\music\\*\\*.mp3')                  (4)
  1. As you saw earlier, os.listdir simply takes a directory path and lists all files and directories in that directory.

  2. The glob module, on the other hand, takes a wildcard and returns the full path of all files and directories matching the wildcard. Here the wildcard is a directory path plus “*.mp3”, which will match all .mp3 files. Note that each element of the returned list already includes the full path of the file.

    System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/diveintopython/6.rst, line 1005); backlink

    Inline emphasis start-string without end-string.

  3. If you want to find all the files in a specific directory that start with “s” and end with “.mp3”, you can do that too.

  4. Now consider this scenario: you have a music directory, with several subdirectories within it, with .mp3 files within each subdirectory. You can get a list of all of those with a single call to glob, by using two wildcards at once. One wildcard is the “*.mp3” (to match .mp3 files), and one wildcard is within the directory path itself, to match any subdirectory within c:music. That’s a crazy amount of power packed into one deceptively simple-looking function!

    System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/diveintopython/6.rst, line 1012); backlink

    Inline emphasis start-string without end-string.

Further Reading on the os Module

6.6. Putting It All Together

Once again, all the dominoes are in place. You’ve seen how each line of code works. Now let’s step back and see how it all fits together.

Example 6.21. listDirectory

def listDirectory(directory, fileExtList):                                         (1)
    "get list of file info objects for files of particular extensions"
    fileList = [os.path.normcase(f)
                for f in os.listdir(directory)]
    fileList = [os.path.join(directory, f)
               for f in fileList
                if os.path.splitext(f)[1] in fileExtList]                          (2)
    def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):       (3)
        "get file info class from filename extension"
        subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]        (4)
        return hasattr(module, subclass) and getattr(module, subclass) or FileInfo (5)
    return [getFileInfoClass(f)(f) for f in fileList]                              (6)
  1. listDirectory is the main attraction of this entire module. It takes a directory (like c:music_singlesin my case) and a list of interesting file extensions (like [‘.mp3’]), and it returns a list of class instances that act like dictionaries that contain metadata about each interesting file in that directory. And it does it in just a few straightforward lines of code.
  2. As you saw in the previous section, this line of code gets a list of the full pathnames of all the files in directory that have an interesting file extension (as specified by fileExtList).
  3. Old-school Pascal programmers may be familiar with them, but most people give me a blank stare when I tell them that Python supports nested functions – literally, a function within a function. The nested function getFileInfoClass can be called only from the function in which it is defined, listDirectory. As with any other function, you don’t need an interface declaration or anything fancy; just define the function and code it.
  4. Now that you’ve seen the os module, this line should make more sense. It gets the extension of the file (os.path.splitext(filename)[1]), forces it to uppercase (.upper()), slices off the dot ([1:]), and constructs a class name out of it with string formatting. So c:musicapmahadeva.mp3 becomes .mp3 becomes .MP3 becomes MP3 becomes MP3FileInfo.
  5. Having constructed the name of the handler class that would handle this file, you check to see if that handler class actually exists in this module. If it does, you return the class, otherwise you return the base class FileInfo. This is a very important point: this function returns a class. Not an instance of a class, but the class itself.
  6. For each file in the “interesting files” list (fileList), you call getFileInfoClass with the filename (f). Calling getFileInfoClass(f) returns a class; you don’t know exactly which class, but you don’t care. You then create an instance of this class (whatever it is) and pass the filename (f again), to the __init__ method. As you saw earlier in this chapter, the __init__ method of FileInfo sets self[“name”], which triggers __setitem__, which is overridden in the descendant (MP3FileInfo) to parse the file appropriately to pull out the file’s metadata. You do all that for each interesting file and return a list of the resulting instances.

Note that listDirectory is completely generic. It doesn’t know ahead of time which types of files it will be getting, or which classes are defined that could potentially handle those files. It inspects the directory for the files to process, and then introspects its own module to see what special handler classes (like MP3FileInfo) are defined. You can extend this program to handle other types of files simply by defining an appropriately-named class: HTMLFileInfo for HTML files, DOCFileInfo for Word .doc files, and so forth. listDirectory will handle them all, without modification, by handing off the real work to the appropriate classes and collating the results.

6.7. Summary

The fileinfo.py program introduced in Chapter 5 should now make perfect sense.

"""Framework for getting filetype-specific metadata.

Instantiate appropriate class with filename.  Returned object acts like a
dictionary, with key-value pairs for each piece of metadata.
    import fileinfo
    info = fileinfo.MP3FileInfo("/music/ap/mahadeva.mp3")
    print "\\n".join(["%s=%s" % (k, v) for k, v in info.items()])

Or use listDirectory function to get info on all files in a directory.
    for info in fileinfo.listDirectory("/music/ap/", [".mp3"]):
        ...

Framework can be extended by adding classes for particular file types, e.g.
HTMLFileInfo, MPGFileInfo, DOCFileInfo.  Each class is completely responsible for
parsing its files appropriately; see MP3FileInfo for example.
"""
import os
import sys
from UserDict import UserDict

def stripnulls(data):
    "strip whitespace and nulls"
    return data.replace("\00", "").strip()

class FileInfo(UserDict):
    "store file metadata"
    def __init__(self, filename=None):
        UserDict.__init__(self)
        self["name"] = filename

class MP3FileInfo(FileInfo):
    "store ID3v1.0 MP3 tags"
    tagDataMap = {"title"   : (  3,  33, stripnulls),
                  "artist"  : ( 33,  63, stripnulls),
                  "album"   : ( 63,  93, stripnulls),
                  "year"    : ( 93,  97, stripnulls),
                  "comment" : ( 97, 126, stripnulls),
                  "genre"   : (127, 128, ord)}

    def __parse(self, filename):
        "parse ID3v1.0 tags from MP3 file"
        self.clear()
        try:
            fsock = open(filename, "rb", 0)
            try:
                fsock.seek(-128, 2)
                tagdata = fsock.read(128)
            finally:
                fsock.close()
            if tagdata[:3] == "TAG":
                for tag, (start, end, parseFunc) in self.tagDataMap.items():
                    self[tag] = parseFunc(tagdata[start:end])
        except IOError:
            pass

    def __setitem__(self, key, item):
        if key == "name" and item:
            self.__parse(item)
        FileInfo.__setitem__(self, key, item)

def listDirectory(directory, fileExtList):
    "get list of file info objects for files of particular extensions"
    fileList = [os.path.normcase(f)
                for f in os.listdir(directory)]
    fileList = [os.path.join(directory, f)
               for f in fileList
                if os.path.splitext(f)[1] in fileExtList]
    def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):
        "get file info class from filename extension"
        subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]
        return hasattr(module, subclass) and getattr(module, subclass) or FileInfo
    return [getFileInfoClass(f)(f) for f in fileList]

if __name__ == "__main__":
    for info in listDirectory("/music/_singles/", [".mp3"]):
        print "\n".join(["%s=%s" % (k, v) for k, v in info.items()])
        print

Before diving into the next chapter, make sure you’re comfortable doing the following things:

  • Catching exceptions with try...except
  • Protecting external resources with try...finally
  • Reading from files
  • Assigning multiple values at once in a for loop
  • Using the os module for all your cross-platform file manipulation needs
  • Dynamically instantiating classes of unknown type by treating classes as objects and passing them around