A Simple Directory Walker with Filter

The Problem

In my previous post, I presented a simple directory walker which solved some of my annoyances. That directory walker is not not perfect. There are times when I want to filter out the files:

for path_name in dirwalker('/path/to/dir'):
   if some_condition(path_name):
        pass  # Do something 

The Use Cases

In this case, I want to process the files only if some condition is true. I would be nice if we can tell dirwalker to return only the files that match our condition:

from dirwalker import dirwalker, include, exclude

# Only process *.xml files
for path_name in dirwalker('.', include('*.xml')):
   print path_name

# Process all but *.obj, *.bak
for path_name in dirwalker('.', exclude('*.obj', '*.bak')):
   print path_name

# Create my own predicate: process only empty files
import os
def is_empty(path_name):
   stat = os.stat(path_name)
   return stat.st_size == 0
for path_name is dirwalker('.', is_empty):
   print path_name

The Solution

The implementation of the new dirwalker is:

from fnmatch import fnmatch
import os

def exclude(*patterns):
   """A predicate which excludes any file that matches a pattern """
   def predicate(filename):
       return not any(fnmatch(filename, pattern) for pattern in patterns)
   return predicate

def include(*patterns):
   """ A predicate which includes only files that match a list of patterns """
   def predicate(filename):
       return any(fnmatch(filename, pattern) for pattern in patterns)
   return predicate

def dirwalker(root, predicate=None):
   """ Recursively walk a directory and yield the path names """
   for dirpath, dirnames, filenames in os.walk(root):
       for filename in filenames:
           fullpath = os.path.join(dirpath, filename)
           if predicate is None or predicate(filename):
               yield fullpath

Discussion

The new dirwalker takes in an additional parameter: a predicate which returns True for those files we want to process and False otherwise. To maintain backward compatibility, the predicate is default to None which means dirwalker will yield every file it found.

I also created two predicates creators, include and exclude, which create appropriate predicates. As you can see in the usage, it is easy to create a custom predicate if the built-in ones do not work for your purposes. Here are a few suggestions for predicates:

  • Files that are read-only
  • Files that are larger than a certain threshold
  • Files that have been modified within a time frame
  • Files that are symbolic links
  • Black lists and white lists

Conclusion

The dirwalker is now more powerful, thanks to the added functionality. At the same time, it is still simple to use.

A Simple Directory Walker

The Problem

In Python, I often need to traverse a directory recursively and act on the files in some way. The solution is to use os.walk, but this method has three problems:

  1. It returns a tuple of three elements and I often don’t remember the order, which requires to look it up
  2. It does not return the full path to the file. I always have to call os.join to construct the full path
  3. It returns a list of file names, which requires another loop. That means a nested loop

Here is an example:

for dirpath, dirnames, filenames in os.walk(root):
   for filename in filenames:
       fullpath = os.path.join(dirpath, filename)
       # do something with fullpath

The Solution

What I really want is a simple function which takes a directory and return a list of file names relative to that directory:

for fullpath in dirwalker(root):
   # do something with fullpath

Implementing the dirwalker function is not that hard:

def dirwalker(root):
    for dirpath, dirnames, filenames in os.walk(root):
        for filename in filenames:
            fullpath = os.path.join(dirpath, filename)
            yield fullpath

Discussion

The dirwalker function is just a shell on top of os.walk, but it solves the three stated problems. First, it generates a list of path names instead of a tuple. This makes it easier to remember. Second, it returns the path, relative to the root. This is more useful for my usage. Finally, it eliminates the need for nested loops, greatly simplify the coding experience and at the same time improve readability.

I made dirwalker a generator instead of a normal function for a couple of reasons. First, a generator is faster because it “returns” a path name as soon as it constructed one. The caller does not have to wait for dirwalker to finish traversing all the sub-directories before receiving the path names. Secondly, dirwalker does not need to store all the path names in a list before returning to the caller, saving memory. Finally, the caller code sometimes want to break out of the loop based on some condition; A normal function will have to traverse all of the directories anyway—even if the caller decide to break out early. Since a generator only generate output on demand, it does not have this problem.

A common pattern I often encounter while gathering files is to exclude or include those that match a set of patterns. In the next post, I will introduce a new feature to dirwalker: filtering.

Conclusion

Gathering files using os.walk is not that hard, but it has its annoyances. That’s the reason I wrote dirwalker. I believe dirwalker can make your code simpler and more Pythonic. Give it a try.

Find Item After a Specific Item

Here is a problem: given a sequence and an item in that sequence, find the item which follows.

Initial Solution

For those who code in C-like languages, the first attempt might looks like this:

def find_item_after(sequence, item):
    items_count = len(sequence)
    for index in range(items_count):
        if sequence[index] == item and index < items_count - 1:
            return sequence[index]
    return None

The problem with this solution is the ugliness of the code, and the inefficiency of indexing.

More Pythonic Solution

def find_item_after(sequence, item):
    for item_here, item_after in zip(sequence, sequence[1:]):
        if item_here == item:
            return item_after
    return None

This solution improves in readability, but at the expense of performance. First, the expression sequence[1:] creates a new sequence from the current one. Then, the zip function creates yet another sequence that are twice the size of the original requence. Note that in Python 3, zip will return a generator object instead of a sequence, which will help in term of efficiency.

Pythonic and Efficient

Until we use Python 3, we are better off using a generator version of zip, namely izip. We can also avoid the memory hogging aspect of sequence[1:] using islice:

from itertools import izip, islice
def find_item_after(sequence, item):
    for item_here, item_after in izip(sequence, islice(sequence, 1, None)):
        if item_here == item:
            return item_after
    return None

This solution is both Pythonic and efficient. However, we can still do better.

Using iter

def find_item_after(sequence, item):
    iterable = iter(sequence)
    for item_here in iterable:
        if item_here == item:
            return next(iterable, None)
    return None

This solution does not use any external library, nor does it create extra copy of the sequence. The next function takes a second parameter, the return value in case that we are at the end of the iterable, when next will generate a StopIteration exception.

What about Find Before?

We can adapt the solution which use iter for this purpose:

def find_item_before(sequence, item):
    iterable = iter(sequence)
    item_before = None
    for item_here in iterable:
        if item_here == item:
            return item_before
        item_before = item_here
    return None

However, the above is too clunky and not efficient: we have to assign a new value to item_before every time we go through the loop. What about the solution which uses izip and islice?

from itertools import izip, islice
def find_item_after(sequence, item):
    for item_before, item_here in izip(islice(sequence, 1, None), sequence):
        if item_here == item:
            return item_before
    return None

Note that in this solution, we start searching in the sequence starting with the second item (at index 1) and eliminate the first item from the search. If the caller wants an item before the first item, the loop will complete without any result and we return None.

Conclusion

This problem was originally an interview question, but it does have some practical application

Find First Pattern in Python

When coding in Python, I often see this pattern:

filename = [f for f in file_list if ....][0]

The above expression traverses through the whole list to find items that fit the criteria (the if phrase), then return the first item.

This construction is inefficient because it did not stop even after it found the first item. The effect is more obvious if one or more of these conditions happen:

  • The list is long
  • The criteria is time consuming
  • This construction happens several times

A better, more efficient approach is to stop searching once you found your candidate:

filename = next(f for f in file_list if ...)

In this approach, the expression inside the next function is a generator expression which returns a generator object. The next function returns the first item that fits the criteria. The net effect is the code does not traverse the entire list. Instead, it stops right after locating the first item.

Note that both approaches do not guard against such cases as empty list for file_list, or when the search came up empty.

Likewise, to find the last item, instead of:

filename = [f for f in file_list if ....][-1]

Do this:

filename = next(f for f in reversed(file_list) if ...)

The reversed function requires its argument to be a sequence (list, tuple, …) and reverse itself returns a generator object. In this situation, we have nested generators, but the effect is the same: the code will stop once it found the last item that fits the criteria and not traversing the entire list.

Eliminate If from Your Code

Introduction

We often write code with too many if statements. If statements are fine, but in many cases, we can eliminate them, making the code simpler and faster.

Example 1: The "or" Trick

In this example, we create a linked list data structure:

class List(object):
    def __init__(self, iterable=None):
        self.head = None
        self.tail = None
        self.length = 0
        if iterable:
            for item in iterable:
                self.append(item)

In this example, the linked list's __init__ method takes in an optional iterable object (an object that we can iterate through). If the iterable is not None, we then iterate through it and append them one by one. At the first glance, the if statement does have its place: it prevents the for loop from iterating over a None object, which causes runtime error.

The key to remove the if statement lies in the fact that we cannot iterate through the None object, but an empty list:

class List(object):
    def __init__(self, iterable=None):
        self.head = None
        self.tail = None
        self.length = 0
        for item in iterable or []:
            self.append(item)

The expression iterable or [] will return the iterable if it is "true", or an empty list if not. When iterable is None, we will iterate through an empty list, which means the body of the for loop will execute zero times.

Example 2: Toggle Values

Consider the following pattern:

if flag:
    flag = False
else:
    flag = True

This block of code toggles the value of flag between True and False. A simpler and more efficient way is:

flag = not flag

Example 3: Boolean Equality Test

Believe it or not, I see the following a few times in my career:

if expected == True and actual == True:
    return True
elif expected == False and actual == False:
    return True
else return False

Or:

if (expected == True and actual == True) or \
   (expected == False and actual == False):
    return True
else return False

The two blocks above say:

If the values of actual and expected are both True or are both False then we return True, otherwise we return False.

In other word, these blocks test to see if actual is the same as expected:

return actual == expected

Example 4: Nested If

I also frequently see code like this:

if wrap_around:
    if x == MAX_WIDTH:
        x = 0
        y += 1

Depend on the situation, we might not be able to eliminate the if statements altogether, but we can reduce the nesting:

if wrap_around and x == MAX_WIDTH:
    x = 0
    y += 1

Example 5: Cycle of Values

In the following example, the coder wanted value of column to cycle between 0 and 7:

column = 0
for cell in data:
    # do something
    column += 1
    if column == 8:
        column = 0

The quick and easy way to improve this code is to employ the modulus operator:

column = 0
for cell in data:
    # do something
    column = (column + 1) % 8

Another fix is to use the cycle function:

from itertools import izip, cycle

for column, cell in izip(cycle(range(8)), data):
    # Do something

In this case, range(8) returns [0, 1, 2, 3, 4, 5, 6, 7] and range will keep cycling values in that list forever. The izip function then pairs that 0-7 cycle with the data.

Conclusion

These are just a couple of instances where we can eliminate the if statements and make the code shorter, faster, and simpler to understand.

Simple Online IDE for Python

I am in need of a quick way to post Python code snippet online, run it and show output. My research points me to quite a few and I like three of them.

codepad

Site: codepad.org

What I like

  • No ad!
  • Does not require signing up to use
  • The code and the output are nicely formatted and easy to view
  • Once submitted, codepad generates a new page with a unique URL. The users cannot change the code unless they choose to fork. This is great to present someone a code snippet without fear of someone modifying it
  • Has a comment section, good for discussion

Improvements Wish List

  • Syntax highlight
  • Code completion
  • Optional title, description, and tags

Python Fiddle

Site: http://pythonfiddle.com

What I like

  • No ad!
  • Does not require signing up to use
  • Syntax highlight
  • Code completion
  • Large editing area

Improvements Wish List

  • Buggy sometimes. I typed import json and got invalid syntax on print import json
  • Importing of modules takes a long time: the first time I import json, it took about 2 seconds
  • Output not clearly labeled

ideone

Site: https://ideone.com

What I like

  • Does not require signing up to use
  • Syntax highlight
  • Many languages
  • Ability to specify stdin

Improvements Wish List

  • Larger editing area
  • Required Flash
  • Lots of ads
  • Output is burried below an ad

Why Returning a Mutable Attribute Might Be Dangerous

Avoid Returning Mutable Attribute

People who come from C++, Java or C# often provide getter/setter to provide access to their classes’ internals. In Python, there is not point to use getter/setter unless there is a good reason for doing so. I have been coding in Python for several years and have not yet come by one such reason.

I have seen code where people prefix a class instance’s attribute with underscore thinking it should make that attribute private. The following example will demonstrate that point:

class Car(object):
    def __init__(self, owners):
        self._owners = owners

    def get_owners(self):
        return self._owners


if __name__ == '__main__':
    car = Car(['Peter', 'Paul'])
    print 'This car belongs to', car.get_owners()

    owners = car.get_owners()
    owners.append('Mary')  
    print 'This car belongs to', car.get_owners()

Output:

This car belongs to ['Peter', 'Paul']
This car belongs to ['Peter', 'Paul', 'Mary']

In the example above, the user calls car.getowners() to gain access to a private member, but then went on modifying it. The immediate solution is to return a copy of self.owners. That way, changes made to the copy does not affect the original member.

Avoid Returning Attributes at All

Instead of allowing the users to gain access to your class’ internals, why not provide what the users really want. Think of a class attributes as nouns (things) and methods as verbs (actions). A class should provide the users with actions, not nouns.

Here is a typical example:

import fnmatch

class FileFilter(object):
    def __init__(self, include=None):
        self._include = include

    @property
    def include_patterns(self):
        return self._include[:]

# Usage example: prints all the files that fit a set of patterns
filter = FileFilter(include=['*.py', '*.ipynb'])
for filename in ['readme.txt', 'hello.py', 'greeting.ipynb']:
    for pattern in filter.include_patterns:
        if fnmatch.fnmatch(filename, pattern):
            print filename
            break

Output:

hello.py
greeting.ipynb

Provide Services Instead of Attributes

In the example above, the class FileFilter provides a getter, include which returns a list of pattern. Instead of providing the users with an attribute (self.include, a noun), it should instead provide a service, an action, namely: shouldinclude:

import fnmatch

class FileFilter(object):
    def __init__(self, include=None):
        self._include = include

    def should_include(self, filename):
        return any(fnmatch.fnmatch(filename, pattern) for pattern in self._include)

# Usage example: prints all the files that fit a set of patterns
filter = FileFilter(include=['*.py', '*.ipynb'])
for filename in ['readme.txt', 'hello.py', 'greeting.ipynb']:
    if filter.should_include(filename):
        print filename

Output:

hello.py
greeting.ipynb

Notice in the example above, FileFilter does not expose any internal data. Instead, it offers a service (an action, or verb, namely should_include). Also notice the code which uses the FileFilter class is also simpler, with one less nested level.

Conclusion

When designing a class, you should avoid returning a class’ internal data, especially mutable structures. Instead, you need to study the users code to see what they use those data for and offer services to accomplish their goals.