Tag Archives: regular expression

Python: Making Complex Regular Expression Easier to Read

In my last post, I shared a way to created regular expression with embedded comments for the Tcl scripting language. It turns out that Python also offers similar feature.

The Problem

I often need to deal with complex regular expression while scripting in Python. The problem is, the expression syntax is terse, cryptic and hard to understand and debug. There must be a better way to deal with regular expression, a way to add comments would be nice.

The Solution

As with my last post, I will use the same example: fishing out email addresses from a chunk of text. Below is the Python counterpart of my previous solution:

import re

if __name__ == '__main__':
    test_data = '''
            This is a bunch of text
            within it, there are some emails such as foo@bar.com
            or one@two.three.net
            What about mixed case: John.Doe@services.company.ws...
            Let see if we can extract them out
            '''
    email_pattern = r'''
            # The part before the @
            [a-z0-9._%-]+

            # The ampersand itself
            @

            # The domain, not including the last dot
            [a-z0-9.-]+

            # The last dot
            \.

            # The top-level domain (TLD), which ranges from 
            # 2 to 4 characters
            [a-z]{2,4}
            '''
    print 'START'
    result = re.findall(email_pattern, 
            test_data, 
            re.IGNORECASE|re.VERBOSE)
    print '\n'.join(result)
    print 'END'

The output:

START
foo@bar.com
one@two.three.net
John.Doe@services.company.ws
END

Conclusion

With the re.VERBOSE flag, I can embed white spaces and comments in the regular expression, making it easier to read and understand.

Advertisements

Tcl: Making Complex Regular Expression Easier to Read

The Problem

I often need to deal with complex regular expression while scripting in Tcl (or other languages, for that matter.) The problem is, the expression syntax is terse, cryptic and hard to understand and debug. There must be a better way to deal with regular expression, a way to add comments would be nice.

For example, below is the expression to extract email addresses:

set email_pattern {[a-z0-9._%-]+@[a-z0-9.-]+\.[a-z]{2,4}}

Below is some code to demonstrate the use of this pattern to extract email from a block of text:

set test_data {
	This is a bunch of text
	within it, there are some emails such as foo@bar.com
	or one@two.three.net
	What about mixed case: John.Doe@services.company.ws...
	Let see if we can extract them out
}

set email_pattern {[a-z0-9._%-]+@[a-z0-9.-]+\.[a-z]{2,4}}

puts "START"
set result [regexp -inline -all -nocase $email_pattern $test_data]
puts [join $result "\n"]
puts "END"

The output:

START
foo@bar.com
one@two.three.net
John.Doe@services.company.ws
END

While this code gets the job done, the lack of document on the regular expression makes it hard to debug the code. Don’t you wish you could include comments to make it easier to read?

The Solutions

My first instinct was to break up the regular expression into parts, then glue them together:

set pre_ampersand {[a-z0-9._%-]+}
set domain {[a-z0-9.-]+}
set tld {\.[a-z]{2,4}}
set email_pattern ""
append email_pattern $pre_ampersand @ $domain $tld

That’s better, the code is now self-documented and I have broken up the long expression into manageable pieces. The drawback is I have to use so many variables to accomplish my goal.

After digging into the Tcl’s regex documentation, I discovered the -expanded flag which will do what I want: It allows me to add white space and comments to the regular expression. Now the code becomes:

set test_data {
	This is a bunch of text
	within it, there are some emails such as foo@bar.com
	or one@two.three.net
	What about mixed case: John.Doe@services.company.ws...
	Let see if we can extract them out
}

set email_pattern {
	# The part before the @
	[a-z0-9._%-]+

	# The ampersand itself
	@

	# The domain, not including the last dot
	[a-z0-9.-]+

	# The last dot
	\.

	# The top-level domain (TLD), which ranges from 2 to 4 characters
	[a-z]{2,4}
}

puts "START"
set result [regexp -expanded -inline -all -nocase $email_pattern $test_data]
puts [join $result "\n"]
puts "END"

The above code accomplished the same goal as before. While it is longer it is better documented and easier to understand and debug. For those who code in Python, a similar feature exists: it is called the re.VERBOSE flag.