The Problem
In Python, I often need to traverse a directory recursively and act on the files in some way. The solution is to use os.walk
, but this method has three problems:
- It returns a tuple of three elements and I often don’t remember the order, which requires to look it up
- It does not return the full path to the file. I always have to call
os.join
to construct the full path - It returns a list of file names, which requires another loop. That means a nested loop
Here is an example:
for dirpath, dirnames, filenames in os.walk(root):
for filename in filenames:
fullpath = os.path.join(dirpath, filename)
# do something with fullpath
The Solution
What I really want is a simple function which takes a directory and return a list of file names relative to that directory:
for fullpath in dirwalker(root):
# do something with fullpath
Implementing the dirwalker
function is not that hard:
def dirwalker(root):
for dirpath, dirnames, filenames in os.walk(root):
for filename in filenames:
fullpath = os.path.join(dirpath, filename)
yield fullpath
Discussion
The dirwalker
function is just a shell on top of os.walk
, but it solves the three stated problems. First, it generates a list of path names instead of a tuple. This makes it easier to remember. Second, it returns the path, relative to the root. This is more useful for my usage. Finally, it eliminates the need for nested loops, greatly simplify the coding experience and at the same time improve readability.
I made dirwalker
a generator instead of a normal function for a couple of reasons. First, a generator is faster because it “returns” a path name as soon as it constructed one. The caller does not have to wait for dirwalker
to finish traversing all the sub-directories before receiving the path names. Secondly, dirwalker
does not need to store all the path names in a list before returning to the caller, saving memory. Finally, the caller code sometimes want to break out of the loop based on some condition; A normal function will have to traverse all of the directories anyway—even if the caller decide to break out early. Since a generator only generate output on demand, it does not have this problem.
A common pattern I often encounter while gathering files is to exclude or include those that match a set of patterns. In the next post, I will introduce a new feature to dirwalker
: filtering.
Conclusion
Gathering files using os.walk
is not that hard, but it has its annoyances. That’s the reason I wrote dirwalker
. I believe dirwalker
can make your code simpler and more Pythonic. Give it a try.