A Simple Directory Walker

The Problem

In Python, I often need to traverse a directory recursively and act on the files in some way. The solution is to use os.walk, but this method has three problems:

  1. It returns a tuple of three elements and I often don’t remember the order, which requires to look it up
  2. It does not return the full path to the file. I always have to call os.join to construct the full path
  3. It returns a list of file names, which requires another loop. That means a nested loop

Here is an example:

for dirpath, dirnames, filenames in os.walk(root):
   for filename in filenames:
       fullpath = os.path.join(dirpath, filename)
       # do something with fullpath

The Solution

What I really want is a simple function which takes a directory and return a list of file names relative to that directory:

for fullpath in dirwalker(root):
   # do something with fullpath

Implementing the dirwalker function is not that hard:

def dirwalker(root):
    for dirpath, dirnames, filenames in os.walk(root):
        for filename in filenames:
            fullpath = os.path.join(dirpath, filename)
            yield fullpath

Discussion

The dirwalker function is just a shell on top of os.walk, but it solves the three stated problems. First, it generates a list of path names instead of a tuple. This makes it easier to remember. Second, it returns the path, relative to the root. This is more useful for my usage. Finally, it eliminates the need for nested loops, greatly simplify the coding experience and at the same time improve readability.

I made dirwalker a generator instead of a normal function for a couple of reasons. First, a generator is faster because it “returns” a path name as soon as it constructed one. The caller does not have to wait for dirwalker to finish traversing all the sub-directories before receiving the path names. Secondly, dirwalker does not need to store all the path names in a list before returning to the caller, saving memory. Finally, the caller code sometimes want to break out of the loop based on some condition; A normal function will have to traverse all of the directories anyway—even if the caller decide to break out early. Since a generator only generate output on demand, it does not have this problem.

A common pattern I often encounter while gathering files is to exclude or include those that match a set of patterns. In the next post, I will introduce a new feature to dirwalker: filtering.

Conclusion

Gathering files using os.walk is not that hard, but it has its annoyances. That’s the reason I wrote dirwalker. I believe dirwalker can make your code simpler and more Pythonic. Give it a try.

Leave a comment