Tag Archives: python

A Simple Directory Walker with Filter

The Problem

In my previous post, I presented a simple directory walker which solved some of my annoyances. That directory walker is not not perfect. There are times when I want to filter out the files:

for path_name in dirwalker('/path/to/dir'):
   if some_condition(path_name):
        pass  # Do something 

The Use Cases

In this case, I want to process the files only if some condition is true. I would be nice if we can tell dirwalker to return only the files that match our condition:

from dirwalker import dirwalker, include, exclude

# Only process *.xml files
for path_name in dirwalker('.', include('*.xml')):
   print path_name

# Process all but *.obj, *.bak
for path_name in dirwalker('.', exclude('*.obj', '*.bak')):
   print path_name

# Create my own predicate: process only empty files
import os
def is_empty(path_name):
   stat = os.stat(path_name)
   return stat.st_size == 0
for path_name is dirwalker('.', is_empty):
   print path_name

The Solution

The implementation of the new dirwalker is:

from fnmatch import fnmatch
import os

def exclude(*patterns):
   """A predicate which excludes any file that matches a pattern """
   def predicate(filename):
       return not any(fnmatch(filename, pattern) for pattern in patterns)
   return predicate

def include(*patterns):
   """ A predicate which includes only files that match a list of patterns """
   def predicate(filename):
       return any(fnmatch(filename, pattern) for pattern in patterns)
   return predicate

def dirwalker(root, predicate=None):
   """ Recursively walk a directory and yield the path names """
   for dirpath, dirnames, filenames in os.walk(root):
       for filename in filenames:
           fullpath = os.path.join(dirpath, filename)
           if predicate is None or predicate(filename):
               yield fullpath

Discussion

The new dirwalker takes in an additional parameter: a predicate which returns True for those files we want to process and False otherwise. To maintain backward compatibility, the predicate is default to None which means dirwalker will yield every file it found.

I also created two predicates creators, include and exclude, which create appropriate predicates. As you can see in the usage, it is easy to create a custom predicate if the built-in ones do not work for your purposes. Here are a few suggestions for predicates:

  • Files that are read-only
  • Files that are larger than a certain threshold
  • Files that have been modified within a time frame
  • Files that are symbolic links
  • Black lists and white lists

Conclusion

The dirwalker is now more powerful, thanks to the added functionality. At the same time, it is still simple to use.

Advertisements

Simple Online IDE for Python

I am in need of a quick way to post Python code snippet online, run it and show output. My research points me to quite a few and I like three of them.

codepad

Site: codepad.org

What I like

  • No ad!
  • Does not require signing up to use
  • The code and the output are nicely formatted and easy to view
  • Once submitted, codepad generates a new page with a unique URL. The users cannot change the code unless they choose to fork. This is great to present someone a code snippet without fear of someone modifying it
  • Has a comment section, good for discussion

Improvements Wish List

  • Syntax highlight
  • Code completion
  • Optional title, description, and tags

Python Fiddle

Site: http://pythonfiddle.com

What I like

  • No ad!
  • Does not require signing up to use
  • Syntax highlight
  • Code completion
  • Large editing area

Improvements Wish List

  • Buggy sometimes. I typed import json and got invalid syntax on print import json
  • Importing of modules takes a long time: the first time I import json, it took about 2 seconds
  • Output not clearly labeled

ideone

Site: https://ideone.com

What I like

  • Does not require signing up to use
  • Syntax highlight
  • Many languages
  • Ability to specify stdin

Improvements Wish List

  • Larger editing area
  • Required Flash
  • Lots of ads
  • Output is burried below an ad

Python: Making Complex Regular Expression Easier to Read

In my last post, I shared a way to created regular expression with embedded comments for the Tcl scripting language. It turns out that Python also offers similar feature.

The Problem

I often need to deal with complex regular expression while scripting in Python. The problem is, the expression syntax is terse, cryptic and hard to understand and debug. There must be a better way to deal with regular expression, a way to add comments would be nice.

The Solution

As with my last post, I will use the same example: fishing out email addresses from a chunk of text. Below is the Python counterpart of my previous solution:

import re

if __name__ == '__main__':
    test_data = '''
            This is a bunch of text
            within it, there are some emails such as foo@bar.com
            or one@two.three.net
            What about mixed case: John.Doe@services.company.ws...
            Let see if we can extract them out
            '''
    email_pattern = r'''
            # The part before the @
            [a-z0-9._%-]+

            # The ampersand itself
            @

            # The domain, not including the last dot
            [a-z0-9.-]+

            # The last dot
            \.

            # The top-level domain (TLD), which ranges from 
            # 2 to 4 characters
            [a-z]{2,4}
            '''
    print 'START'
    result = re.findall(email_pattern, 
            test_data, 
            re.IGNORECASE|re.VERBOSE)
    print '\n'.join(result)
    print 'END'

The output:

START
foo@bar.com
one@two.three.net
John.Doe@services.company.ws
END

Conclusion

With the re.VERBOSE flag, I can embed white spaces and comments in the regular expression, making it easier to read and understand.

Some Python One-Liners

Python is such a simple and get-the-job-done language that many script I wrote is very short. Here are a couple of one- and two-liners:

# oneliners.py by Hai Vu
# Here are some of my one- and two-liners
import os, sys

# To create Pascal naming from a regular phrase
print 'print in order traversal'.title().replace(' ', '') # --> PrintInOrderTraversal

# Print the contents of a file:
print file('oneliners.py').read()

# read the contents of a file and put in a list:
lines = [x[:-1] for x in file('oneliners.py')]

# Poor man's calculator: parse the command line and evaluate it.
# If you save this to a file call eval.py, then call it as follow:
#   python eval.py "28 * 1.15" 
# The output will be:
#   28 * 1.15  = 32.2
expression = ' '.join(sys.argv[1:])
print expression, ' =', eval(expression)

# Prints out the path with each component in a separate line
# also work for such environment variables as INCLUDE, LIB, and CDPATH
print '\n'.join(os.environ['PATH'].split(os.pathsep))

Please submit your favorite one-liners in the comments section.