- Python unittest longMessage
- Use assert outside of unittest.TestCase class
csv module is excellent for parsing comma-separated-values (CSV) files. There are times when we need a CSV file in which we can add a couple of comment lines, but the
csv module does not handle those lines. Continue reading
For my first trip to New Orleans, I spent most of my time in the hospital looking after my mother-in-law. Just hours before getting on the plane back to Seattle, I decided to try the poboy and this place is the nearest to the hospital, so why not?
In my previous post, I presented a simple directory walker which solved some of my annoyances. That directory walker is not not perfect. There are times when I want to filter out the files:
for path_name in dirwalker('/path/to/dir'): if some_condition(path_name): pass # Do something
The Use Cases
In this case, I want to process the files only if some condition is true. I would be nice if we can tell
dirwalker to return only the files that match our condition:
from dirwalker import dirwalker, include, exclude # Only process *.xml files for path_name in dirwalker('.', include('*.xml')): print path_name # Process all but *.obj, *.bak for path_name in dirwalker('.', exclude('*.obj', '*.bak')): print path_name # Create my own predicate: process only empty files import os def is_empty(path_name): stat = os.stat(path_name) return stat.st_size == 0 for path_name is dirwalker('.', is_empty): print path_name
The implementation of the new
from fnmatch import fnmatch import os def exclude(*patterns): """A predicate which excludes any file that matches a pattern """ def predicate(filename): return not any(fnmatch(filename, pattern) for pattern in patterns) return predicate def include(*patterns): """ A predicate which includes only files that match a list of patterns """ def predicate(filename): return any(fnmatch(filename, pattern) for pattern in patterns) return predicate def dirwalker(root, predicate=None): """ Recursively walk a directory and yield the path names """ for dirpath, dirnames, filenames in os.walk(root): for filename in filenames: fullpath = os.path.join(dirpath, filename) if predicate is None or predicate(filename): yield fullpath
dirwalker takes in an additional parameter: a predicate which returns
True for those files we want to process and
False otherwise. To maintain backward compatibility, the predicate is default to
None which means
dirwalker will yield every file it found.
I also created two predicates creators,
exclude, which create appropriate predicates. As you can see in the usage, it is easy to create a custom predicate if the built-in ones do not work for your purposes. Here are a few suggestions for predicates:
- Files that are read-only
- Files that are larger than a certain threshold
- Files that have been modified within a time frame
- Files that are symbolic links
- Black lists and white lists
dirwalker is now more powerful, thanks to the added functionality. At the same time, it is still simple to use.
In Python, I often need to traverse a directory recursively and act on the files in some way. The solution is to use
os.walk, but this method has three problems:
- It returns a tuple of three elements and I often don’t remember the order, which requires to look it up
- It does not return the full path to the file. I always have to call
os.jointo construct the full path
- It returns a list of file names, which requires another loop. That means a nested loop
Here is an example:
for dirpath, dirnames, filenames in os.walk(root): for filename in filenames: fullpath = os.path.join(dirpath, filename) # do something with fullpath
What I really want is a simple function which takes a directory and return a list of file names relative to that directory:
for fullpath in dirwalker(root): # do something with fullpath
dirwalker function is not that hard:
def dirwalker(root): for dirpath, dirnames, filenames in os.walk(root): for filename in filenames: fullpath = os.path.join(dirpath, filename) yield fullpath
dirwalker function is just a shell on top of
os.walk, but it solves the three stated problems. First, it generates a list of path names instead of a tuple. This makes it easier to remember. Second, it returns the path, relative to the root. This is more useful for my usage. Finally, it eliminates the need for nested loops, greatly simplify the coding experience and at the same time improve readability.
dirwalker a generator instead of a normal function for a couple of reasons. First, a generator is faster because it “returns” a path name as soon as it constructed one. The caller does not have to wait for
dirwalker to finish traversing all the sub-directories before receiving the path names. Secondly,
dirwalker does not need to store all the path names in a list before returning to the caller, saving memory. Finally, the caller code sometimes want to break out of the loop based on some condition; A normal function will have to traverse all of the directories anyway—even if the caller decide to break out early. Since a generator only generate output on demand, it does not have this problem.
A common pattern I often encounter while gathering files is to exclude or include those that match a set of patterns. In the next post, I will introduce a new feature to
Gathering files using
os.walk is not that hard, but it has its annoyances. That’s the reason I wrote
dirwalker. I believe
dirwalker can make your code simpler and more Pythonic. Give it a try.
Here is a problem: given a sequence and an item in that sequence, find the item which follows.
For those who code in C-like languages, the first attempt might looks like this:
def find_item_after(sequence, item): items_count = len(sequence) for index in range(items_count): if sequence[index] == item and index < items_count - 1: return sequence[index] return None
The problem with this solution is the ugliness of the code, and the inefficiency of indexing.
More Pythonic Solution
def find_item_after(sequence, item): for item_here, item_after in zip(sequence, sequence[1:]): if item_here == item: return item_after return None
This solution improves in readability, but at the expense of performance. First, the expression
sequence[1:] creates a new sequence from the current one. Then, the
zip function creates yet another sequence that are twice the size of the original requence. Note that in Python 3, zip will return a generator object instead of a sequence, which will help in term of efficiency.
Pythonic and Efficient
Until we use Python 3, we are better off using a generator version of
izip. We can also avoid the memory hogging aspect of
from itertools import izip, islice def find_item_after(sequence, item): for item_here, item_after in izip(sequence, islice(sequence, 1, None)): if item_here == item: return item_after return None
This solution is both Pythonic and efficient. However, we can still do better.
def find_item_after(sequence, item): iterable = iter(sequence) for item_here in iterable: if item_here == item: return next(iterable, None) return None
This solution does not use any external library, nor does it create extra copy of the sequence. The
next function takes a second parameter, the return value in case that we are at the end of the iterable, when
next will generate a
What about Find Before?
We can adapt the solution which use
iter for this purpose:
def find_item_before(sequence, item): iterable = iter(sequence) item_before = None for item_here in iterable: if item_here == item: return item_before item_before = item_here return None
However, the above is too clunky and not efficient: we have to assign a new value to
item_before every time we go through the loop. What about the solution which uses
from itertools import izip, islice def find_item_after(sequence, item): for item_before, item_here in izip(islice(sequence, 1, None), sequence): if item_here == item: return item_before return None
Note that in this solution, we start searching in the sequence starting with the second item (at index 1) and eliminate the first item from the search. If the caller wants an item before the first item, the loop will complete without any result and we return
This problem was originally an interview question, but it does have some practical application
When coding in Python, I often see this pattern:
filename = [f for f in file_list if ....]
The above expression traverses through the whole list to find items that fit the criteria (the if phrase), then return the first item.
This construction is inefficient because it did not stop even after it found the first item. The effect is more obvious if one or more of these conditions happen:
- The list is long
- The criteria is time consuming
- This construction happens several times
A better, more efficient approach is to stop searching once you found your candidate:
filename = next(f for f in file_list if ...)
In this approach, the expression inside the
next function is a generator expression which returns a generator object. The next function returns the first item that fits the criteria. The net effect is the code does not traverse the entire list. Instead, it stops right after locating the first item.
Note that both approaches do not guard against such cases as empty list for
file_list, or when the search came up empty.
Likewise, to find the last item, instead of:
filename = [f for f in file_list if ....][-1]
filename = next(f for f in reversed(file_list) if ...)
reversed function requires its argument to be a sequence (list, tuple, …) and reverse itself returns a generator object. In this situation, we have nested generators, but the effect is the same: the code will stop once it found the last item that fits the criteria and not traversing the entire list.