Tuesday, September 24, 2013

Learn How to Use regex: 99 Ways Workshop #87

The Software Testing Club recently put out an eBook called "99 Things You Can Do to Become a Better Tester". Some of them are really general and vague. Some of them are remarkably specific.


My goal for the next few weeks is to take the "99 Things" book and see if I can put my own personal spin on each of them, and make a personal workshop out of each of the suggestions.


Suggestion #87: Learn how to use regex.


All right, we're starting to get into direct recommendations that were workshop ideas from before. That means we get to start getting more specific.

I mentioned in an earlier post that shell scripts using regex can help make for dynamic tools to parse information and clean up output for later use, but I didn't say a word about what regex actually is.

For those that know all this already, this may be of limited value. If this is new, this is by no means an exhaustive treatment of regex, but I'd be remiss to not at least offer some nuts & bolts and examples, so that you can see why this might be something you'd want to learn more about.


Therefore, without further ado...

Workshop #87: Determine what tool(s) you would like to use to practice working with "REGular EXpressions" and use that tool(s) to practice using them. 

I've heard this said a number of different ways, and since we are reading it, it probably doesn't matter much, but this is a personal thing, and you can decide for yourself how or what you want to call it. I'm a firm believer in the idea that, if something is an abbreviation of a set of words, then the pronunciation of the words as a whole needs to inform what the abbreviation will sound like. Since we are talking about REGular EXpressions, you will always hear me say it as "REG-ex" (hard "g"), not REJ-ex (soft "g"). What you choose to call it is totally up to you ;).

Anyway… regex works by letting the user provide a rule, and that rule is meant to help identify a sequence of alphanumeric characters. If we choose to look for an exact word ("apple"), we can just use the exact word, and tools like grep, sed, awk, etc. will find the literal match of that word.

That's OK, but more often than not, we want to look for items that will be dynamically assigned through variables or have a variety of ways to be found, not just absolute terms or phrases. What do we do then, if we don't know in advance what the value will be?

regex to the rescue :).
Here's some very quick regex examples, written from the perspective of bash/linux tools:

- a period (.) matches any single character.  '….' would match any four characters

- 'A.' matches an “A” followed by any character

- '.A.' matches any character, then an “A”, then any character

- an asterisk '*' means 'repeat zero or more of the previous character'.

- 'A*' means zero or more “A” characters

- '.*' means zero or more of any character (letter, number, symbol, a blank line, etc.)

- '..*' means any single character, followed by one or more of any character (so an empty line isn't an
option here).

- '^' means the beginning of a line.

- '$' means the end of a line.

- '^$' would mean any blank line.

- '\' is used as an escape character. This means if you want to look for any of the above examples literally (., *, ^, $), you would use it first, like '\. \* \^ \$'

- '[abc]' is a "range". In this case, it means "find any 'a', b', or c', anywhere on this line. Using the [] range option, we can also use the '^', but in this case, '[^abc]' means "show me lines that do NOT have the characters 'a', 'b', or 'c'. All inclusive ranges can also use a shorthand like [A-Z] or [0-9].

- '\{n,m\}' is used in bash and Linux (and elsewhere) as a repetition option. If we see something like '[0-9]\{2,3\}', this would mean "show me lines for any numeric sequence that has two or three digits. If we were to see something like '[0-9]\{3\}', that would mean "show me lines with exactly 3 numerals in sequence".

OK, with that, let's try something a little more interesting.

$ grep '[0-9]\{3\}-\{0,1\}[0-9]\{2\}-\{0,1\}[0-9]\{4\}' datafile

What do you think this might be? As printed, it might be hard to tell, but if we were to look at each element separately, we can probably figure it out.

[0-9] any sequence of numerals between 0 and 9

\{3\} limited to three numerals exactly

- a literal dash character

\{0,1\} repeated zero or one times

[0-9] any sequence of numerals between 0 and 9

\{2\} limited to two numerals exactly

- a literal dash character

\{0,1\} repeated zero or one times

[0-9] any sequence of numerals between 0 and 9

\{4\} limited to four numerals exactly


If you guessed that this is a regex to help find a US Social Security Number, you are correct (this example comes courtesy of the "bash Cookbook" by O'Reilly).


There are many more options to regex. This just scrapes the surface, but even with just this level of understanding, you can do a lot. A full blown tuorial on regex goes well beyond the scope of these posts, but if you would like to have a general, all purpose tutorial on regex, check out http://www.regular-expressions.info/tutorial.html


Also, there are a variety of regex "engines" available. Linux uses the POSIX engine. Programming languages use a variety of standards, many of them similar but with their own individual quirks. If you are using programming languages like Python, Ruby, PHP, Perl, etc. you will need to look at how regex is implemented in your language of choice.


Bottom Line:


regex is a core idea in a variety of scripting languages, programming languages and methods to make shell scripts much more dynamic. They take time to understand, and like anything other skill, repetition and practice goes a long way towards a better understanding of how to use them.

No comments: