In Python, I often need to traverse a directory recursively and act on the files in some way. The solution is to use
os.walk, but this method has three problems:
- It returns a tuple of three elements and I often don’t remember the order, which requires to look it up
- It does not return the full path to the file. I always have to call
os.jointo construct the full path
- It returns a list of file names, which requires another loop. That means a nested loop
Here is an example:
for dirpath, dirnames, filenames in os.walk(root): for filename in filenames: fullpath = os.path.join(dirpath, filename) # do something with fullpath
What I really want is a simple function which takes a directory and return a list of file names relative to that directory:
for fullpath in dirwalker(root): # do something with fullpath
dirwalker function is not that hard:
def dirwalker(root): for dirpath, dirnames, filenames in os.walk(root): for filename in filenames: fullpath = os.path.join(dirpath, filename) yield fullpath
dirwalker function is just a shell on top of
os.walk, but it solves the three stated problems. First, it generates a list of path names instead of a tuple. This makes it easier to remember. Second, it returns the path, relative to the root. This is more useful for my usage. Finally, it eliminates the need for nested loops, greatly simplify the coding experience and at the same time improve readability.
dirwalker a generator instead of a normal function for a couple of reasons. First, a generator is faster because it “returns” a path name as soon as it constructed one. The caller does not have to wait for
dirwalker to finish traversing all the sub-directories before receiving the path names. Secondly,
dirwalker does not need to store all the path names in a list before returning to the caller, saving memory. Finally, the caller code sometimes want to break out of the loop based on some condition; A normal function will have to traverse all of the directories anyway—even if the caller decide to break out early. Since a generator only generate output on demand, it does not have this problem.
A common pattern I often encounter while gathering files is to exclude or include those that match a set of patterns. In the next post, I will introduce a new feature to
Gathering files using
os.walk is not that hard, but it has its annoyances. That’s the reason I wrote
dirwalker. I believe
dirwalker can make your code simpler and more Pythonic. Give it a try.