Tcl: Making Complex Regular Expression Easier to Read

The Problem

I often need to deal with complex regular expression while scripting in Tcl (or other languages, for that matter.) The problem is, the expression syntax is terse, cryptic and hard to understand and debug. There must be a better way to deal with regular expression, a way to add comments would be nice.

For example, below is the expression to extract email addresses:

set email_pattern {[a-z0-9._%-]+@[a-z0-9.-]+\.[a-z]{2,4}}

Below is some code to demonstrate the use of this pattern to extract email from a block of text:

set test_data {
	This is a bunch of text
	within it, there are some emails such as foo@bar.com
	or one@two.three.net
	What about mixed case: John.Doe@services.company.ws...
	Let see if we can extract them out
}

set email_pattern {[a-z0-9._%-]+@[a-z0-9.-]+\.[a-z]{2,4}}

puts "START"
set result [regexp -inline -all -nocase $email_pattern $test_data]
puts [join $result "\n"]
puts "END"

The output:

START
foo@bar.com
one@two.three.net
John.Doe@services.company.ws
END

While this code gets the job done, the lack of document on the regular expression makes it hard to debug the code. Don’t you wish you could include comments to make it easier to read?

The Solutions

My first instinct was to break up the regular expression into parts, then glue them together:

set pre_ampersand {[a-z0-9._%-]+}
set domain {[a-z0-9.-]+}
set tld {\.[a-z]{2,4}}
set email_pattern ""
append email_pattern $pre_ampersand @ $domain $tld

That’s better, the code is now self-documented and I have broken up the long expression into manageable pieces. The drawback is I have to use so many variables to accomplish my goal.

After digging into the Tcl’s regex documentation, I discovered the -expanded flag which will do what I want: It allows me to add white space and comments to the regular expression. Now the code becomes:

set test_data {
	This is a bunch of text
	within it, there are some emails such as foo@bar.com
	or one@two.three.net
	What about mixed case: John.Doe@services.company.ws...
	Let see if we can extract them out
}

set email_pattern {
	# The part before the @
	[a-z0-9._%-]+

	# The ampersand itself
	@

	# The domain, not including the last dot
	[a-z0-9.-]+

	# The last dot
	\.

	# The top-level domain (TLD), which ranges from 2 to 4 characters
	[a-z]{2,4}
}

puts "START"
set result [regexp -expanded -inline -all -nocase $email_pattern $test_data]
puts [join $result "\n"]
puts "END"

The above code accomplished the same goal as before. While it is longer it is better documented and easier to understand and debug. For those who code in Python, a similar feature exists: it is called the re.VERBOSE flag.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s