Regular Expressions

This article was was translated and adapted with permission by the original author. His original article can be found at Tutorial Reguläre Ausdrücke.

Tutorial Regular expressions

“Regular expressions” are a kind of language that can be used when programming for various problem-solving tasks, especially when it comes to examining strings or searching for something inside them.

And because the name “Regular Expressions” is a little bulky, they are often simply reduced to “RegEx(s)”.

Introduction

Here is a small tutorial on these esoterically appealing but incredibly powerful strings, which trigger a child-like wonder as one makes his first attempts at the keyboard.

What are regular expressions? As I have said, string expressions can be tested for a particular composition using regular expressions. This can be important for applications that expect user input.
Examples:

school grades
postal codes
E-mail addresses
Order numbers

Concrete application

Popular places where you can encounter regular expressions and apply this knowledge:

Web applications (e.g. in PHP, Perl)
Unix scripts

In the following, it is assumed that you are already familiar with how regular expressions are used in the environment of your choice. In PHP this is done, for example, with the functions preg_match (checking and finding strings) and preg_replace (replacing strings).

The best way to test your expression is with a “dry-run”. You can use an online tool like regex 101 or the regex coach. There are numerous other options, but those are two good examples.

Good style

Often you will find that there are different solutions to the same problem. In this case, one is confronted with the following options.

An exact vs. a more general solution. If there is a risk that the exact solution is too “restrictive” (and thus may frustrate a user if their correct input is not accepted), take the general one. If the problem is very “clear”, rather take the exact one.
An accurate vs. a fast solution. Long regular expressions can take a lot of time to process. Again, you have to weigh what is more important to you. It could annoy users if it takes a very long time to check an input.
A simple vs. an “elegant” solution. You could express in 10 characters what others could not with 60, great, but remember that you may want to change something later! That can be harder than you think. Therefore, it’s better to take the simpler variant or comment complicated expressions.

Problems

If your regular expression causes problems, but you’re pretty sure it’s right, start by checking the special characters and what you have to do with them to use them. Special characters are already “banal” things like a period.

In addition you should really test your expressions in the RegEx-Coach or the like! This sometimes provokes incredibly irregular behavior. ;-)

You can also use Google and Google Groups to find lots of good information and templates to various common problems.

Conventions

In this tutorial the following colors are used for texts (strings):

Regular expression that you can theoretically use
Marked new parts
Intentionally erroneous code

I often use the “string”, which is not a stranger to programmers, for some text. In addition, an expression “meets” = “describes a text exactly” = “does the desired function” = “matches”.

Simple expressions

One-element regular expressions

Let’s start small. We want to check whether an input corresponds to a school grade of 1-6.

[123456]

does that for us. You’ll see, in square brackets, a list of characters that are allowed. Overall, the entire bracketed expression is only for one character (i.e. 1 or 2 or … or 6).

Since our numbers from 1 to 6 follow each other so beautifully, we can also write

[1-6]

which makes things a bit clearer. In exactly the same way, you could check whether an entry corresponds to a track on a train station:

[1-9]

for a station with 9 tracks. Today, in our station, track 4 was blocked so it’s not allowed as input:

[1-35-9]

We divided the input area into two areas 1-3 and 5-9. As you can see, the two areas are simply written next to each other. This might feel a little strange at first (intuitively one would like to use a space), but we will encounter that quite often.

Multiple regular expressions

What if our station has grown some and is now extended to 12 tracks:

[1-12]

(Track 4 is open again ;-)). But beware, there’s a mistake!

As mentioned above, the entire expression in the brackets is only for one character. “12” contains two characters. The code above will not work: it means “1 to 1” or “2”. We will demonstrate how to deal with railway stations with more than 9 tracks later.

For the moment we want to check something different: Platforms 1-9 each have a section “a” and “b”. We shall therefore check for 1a, 1b, 2a, …, 9a, 9b:

[1-9][ab]

Let’s say a number 1-9 followed by a letter a or b.

Thus, several rectangular brackets correspond to several characters. If an expression describes several characters, they are simply placed one after the other. Then the input from left to right is compared with your expression.

Of course not all letters have to be listed:

[1-9][a-d]

This also now applies to 4d.

Please note that upper and lower case letters are handled separately. A meaningful extension would therefore be:

[1-9] [a-dA-D]

Options

Let’s look at entering a house number. These can consist of one or more (up to 3) digits and an a-z at the end, though this is not required.

[1-9][0-9]?[0-9]?[a-z]?

The question mark behind an element (and [1-9] is an element!) states: The preceding element can occur but does not have to.

The last code is: A digit (1-9), optionally two more digits (0-9), optionally a letter.

Consider such a construction with 10 digits. You can easily imagine how it could become very long. There is a different way of writing it if you want to allow an element more than once:

a{1,3}h

Matches “ah” just as well as “aaah”. The number in parentheses stands for the {minimum, maximum} number of characters. Therefore, we can also formulate our house numbers as follows:

[1-9][0-9]{0,2}[a-z]?

And then for longer streets (USA ;-)) use something like:

[1-9][0-9]{1,4}[a-z]?

Now it allows house numbers in the five-digit range, but requires at least two-digits (note the change before the comma).

A construction like

[0-9]{5}

(i.e., a square bracket with only one entry) requires an exact number of repetitions of the preceding expression. In our case 5 times: This regular expression would be suitable to check (German) postal codes.

We can also express an “at least”

[0-9]{3,}

matches at least 3 digits.

Any repetitions

So far we have always had to know how many characters were to be found. There are, however, cases in which a sign may be repeated as often as desired.

Let’s look at the phone number verification. If you write a regular expression, it is often a good idea to make a note of what it should do. Our telephone numbers should be formatted like the following:

0651/55541-36
0049 160 555678
0180.23.555.63

In addition to numbers, there may also be hyphens, slashes, blanks, and dots. A valid element would be

[0-9/. -]

(So this one character consists of a number between 0 and 9 OR a slash OR a point or a space or a minus.) Be careful! As you can see, there are two hyphens in the parentheses, one in a special function to indicate a range of numbers, and one as a “real” hyphen, which may appear. In order to reduce confusion, we have to mask the latter so one can see that it has no special function here. This can be done with a pre-defined backslash:

[0-9/. \-]

And then let’s say that these characters may occur as often as you like:

[0-9/. \-]+

The plus means: one or more of the preceding character, i.e. at least one time. If we would also allow an empty telephone number (the mathematician would say: the null number), there is a different character:

[0-9/. \-]*

also applies to “” (empty string) and of course our telephone numbers.

This expression is, of course, a very general expression (see discussion above). It also allows phone numbers, which are obviously not correct, like

--..//123

If you want to further restrict this, you have to develop a more complex expression.

So to summarize again: The + stands for a repetition that happens one or more times, while a * means zero or more number of times.

Placeholders

Let’s look at another example. We want to search for a writer in a library, but we do not know if he has a second first name. Therefore, the first name can be followed by any characters, and then by the surname. A possible solution is as follows:


Marius .*Pontmercy

which means: “Marius”, followed by a space followed by an arbitrary character (that’s what the point stands for!) as often as desired (and possibly not at all) and then the surname. This applies to “Marius Pontmercy” as well as to “Marius Miller Pontmercy”.

Just like that, you can also use the + sign after the point, to get at least one character.

The point “normally eats” almost everything, but not line breaks. You can find how to get them to do that under modifiers.

The “attentive reader” will have noticed the fact that we already had a point above within the square brackets. As with many, but not all, special characters, these must be masked outside of square brackets (backslash before), but not within.

Negate the characters

Suppose we do not know the author’s second first name, but we can remember that it does not contain a “q” or a “z”. Not a problem:

Marius [^qz]+ Pontmercy

requires a second first name (ergo the plus), and assigns an arbitrary character that is not q or z. The ^ negates a character class and applies exactly to the closing parenthesis.

Brackets

Brackets can be used to summarize longer expressions into an element, thereby allowing the above-learned to be applied to partial expressions:

Marius (Miller )?Pontmercy

applies to exactly “Marius Miller Pontmercy” or “Marius Pontmercy” and nothing else. The question mark refers (thanks to the brackets) to the complete second first name and the following space.

It is also possible to write:

Ba(na)*na

which also stands for this yellow thing as well as “Banananana”. Again, curly brace notation can be used:

Ba(na){2,5}na

which of course now has a different meaning.

Alternatives

You can also do other things with brackets, i.e. specify alternatives to a partial expression:

The weather is (great|really bad)

In this example, last words may be “great” or “really bad”, but not both.

Modifiers

In all RegEx variants, you can set so-called modifiers and thus control the exact behavior of the expression. You can do this in Java, for example, when constructing a matcher object. In PHP, a regular expression always has the syntax

[Limits][RegEx][Limits][Modifier(s)]

for example:

/(My|Expression)/im

The slashes are the limiting characters (others are conceivable here, for example ~), and “i” and “m” in this case are modifiers. Common modifiers include:

i Case-Insensitivity (case-insensitive)
s point is multi-lineable: the point also eats line breaks (this is not the case as a rule).
m line mode: The characters ^ and $ also match the beginning and end of the lines. Without the modifier, they only match the start and end of the entire string.

Modifiers always refer to the entire expression and are therefore an easily overlooked source of error.

More complex expressions

Nesting

Of course, various brackets may also be nested inside one another like in the following expression:

 (VW (Golf|Polo)|Fiat (Punto|Panda))

Which meets “VW Golf”, “VW Polo”, “Fiat Punto” and “Fiat Panda”. This allows very short expressions for long strings, but it can also take a lot of time to process.

These alternatives can also be repeated:

(10|01)+

describes a sequence of zeros and ones in which a maximum of 2 zeros or ones follow one another. (If this is not clear now, just think about what you can do with “10” and “01”.)

Greedy expressions

A practical example: The following, somewhat longer code looks for the links or their destination addresses of an HTML file. A link in an HTML source code has, as a rule, a format like this:

<a href="[...destination address...]" [...Additional information ...]>

for example:

<a href="http://www.example.org/" target="_blank">

A possible expression for this is found quite quickly:

<a href=".*".*>

Looks good, but it does not work. Why?

The author thought the expression would stop at the link to the link, but it does not. The following is a section from an HTML file with a selected hit:


<BODY> Bla Blubb 1 2 3 <a href="http://www.example.org/" target="_blank">
link text </a> Much, much more <i>Text</i> Blubb 42

This clearly goes too far! The second point is too “greedy” and eats all signs even over several closing angle brackets. In other situations it would be possible for the first point run amok.

In our case, there are two possible solutions. Often, however, you can only use one, so I’ll present both:

Replace the point with something that does not eat any more angle brackets:

<a href=".*"[^>]*>

This is a method that is often used (consider which characters mark the end and exclude them from the one that applies). The whole thing must now be done analogously for the first point:

<a href="[^"]*"[^>]*>

The point “re-educate” (or * making it frugal) so that both together only eat as much as is absolutely necessary. This is done by adding a question mark, which of course no longer has the known function:

<a href=".*?".*?>

Of course this special use of the question mark also works behind a plus sign.

Groups

Virtually all RegEx dialects allow for the formation of groups and their storage for later use. Brackets can also be used for this purpose.

Subsequent use may mean that an expression is supposed to hit a longer character string, but only a part of it should be used.

So, in conjunction with the above expression, it would be very useful to filter all the destination addresses from an HTML page.

How exactly you get to the contents of the groups, is in the guidance of the language you use. For PHP you’ll find it, for example, at preg_match and java, as far as I can remember, somewhere near regex.matcher. Google should know more.

<a href="(.*?)".*?>

The first point and its “multipliers” are bracketed. The first group now contains the URL.

Note that when the groups are numbered, the order of the parenthesis counts. This is important to note when using nested brackets. In addition, all brackets are usually included, even if they are only used to identify alternatives (see above).

Credentials

There is another, very useful purpose for groups. Imagine if you want to find all the numbers that start and end with the same number in a sequence of numbers (say they are separated by a space).

So we have

129 337873 78324 43938 9388 824998 349734

We can use such an expression:

([0-9])[0-9]*\1

Where the \1 points to the content of the first parenthesis and therefore must be the same as in this parenthesis. The first and last character of our expression are each a space to indicate the end and beginning of a number.

Let’s look at what it is:

129 337873 78324 43938 9388 824998 349734

As you can see, the first and last spaces are still added to the respective hit. This is not surprising, after all there is also a space in front of and behind our expression. This is not really good if this blank space really doesn’t interest us. Therefore, the RegEx language also provides a means, which also works if we do not want or can not use groups:

Special characters: word boundaries

There are characters in the regular expression, that are not in the text, which are matched afterwards. This does not tell you anything? Well, let’s look at an example of how to write the above expression without the spaces:

\b([0-9])[0-9]*\1\b

This “\b” is a related element and indicates a word beginning or a word end (i.e. a word boundary). Do you know what is often used in word processing programs and editors “Search for a whole word only” in the search function? If you choose this option, then in a search for “Kai”, instead of these hits

Kai goes after
Kaiserslautern

you’ll only find this match:

Kai goes after
Kaiserslautern

The latter can be translated or expressed in regular expressions:

\bKai\b

So “Kai” only if it is surrounded by word boundaries. What exactly is a word boundary? A word boundary occurs between a word character and a nonword character. Huh? (Explanation follows!)

Special characters: Other characters

There are - surprise! - even more special characters that you can use which can shorten a RegEx beautifully. These always consist of a backslash followed by another character (letters). Incidentally, a large letter always stands for the opposite of a small one.

\w \W

A word-sign (small w) stands exactly for [a-zA-Z0-9] and a non-word sign (large W) stands exactly for everything else, thus [^a-zA-Z0-9].

\d \D

A digit, that is a digit from 0-9. Corresponds to [0-9] and uppercase [^0-9].

\b \B

You’ve already been introduced to the “small” variant above. A large B stands, accordingly, for all places where no word boundary occurs.

\s \S

The small variant stands for all whitespaces: these are almost all signs that you can not see. So Return (or Enter), Empty (Space), Tab (ulator).

\\

Special characters, as you have seen, begin if you start with backslash. If you need to have a backslash expressly, you can do this simply by escaping with a second backslash. It is said that the first “masks” the second.

\. \+ \* \( \) \[ \] \- \$ \|

The point and many other characters, as you have seen above, have a special function. Therefore, they are blocked ( “masked”) by means of a backslash. This applies to the minus sign only within character classes, but you can omit the backslash in the same for many other characters.

Beginning and ending characters

Imagine you want to check a date, let’s say

1\.3\.2004

And you let it run on the following date

11.3.2004

What’s the result? Your expression will tell you that it hits. That’s also clear somehow because if you leave out the the one, what remains is your date. But you want the entire date to look like yours?

^1\.3\.2004$

This carrot at the beginning says: Only match if the string to be searched starts here. And the dollar sign stands for: Only if the string to be searched ends here.

Of course, one can also use them individually; As an example, let’s look at two regular expressions: the first is a string that ends with a number. The second is a string that begins with an opening parenthesis. In doing so, we use the above mentioned masking for the parenthesis.

\d$
^\(

Positive lookaheads and lookbehinds

You have now learned how to determine word boundaries without actually eating them yourself. There is also a universal possibility.

The following sequence is used as an example:

000566403580000345050052301078906040092800100001007680

From this sequence of numbers, you need to extract all sequences containing three digits not equal to zero and bounded by a zero on each side. In our case, therefore, “358”, “345”, “523”, “789”, “928” and “768”. You can use the following code:

(?<=0)[1-9]{3}(?=0)

Don’t panic! To explain, we’ll begin in the middle. The [1-9]{3} should be clear (3 digits not equal to zero). Before that you can see the brackets (?<=0). The characters ?<= indicate a special function for the parenthesis. You have to look at them together. This stands for “lookahead for a zero”. And because “lookahead” in the normal reading direction of the Western world means that the clip looks “backwards”, this function is called a “positive lookbehind”.

Just as with the characters “?=”, these assign the special function “lookbehind … must come” to the bracket, in our case, therefore, “lookbehind for a zero”. The whole is called “positive lookahead”, because the parenthesis looks “forward”.

So, if we use our knowledge about “\W”, we could also express our “\b” example from above:

(?<=\W)(\d)\d*\1(?=\W)

For this explanation, we’ll start again in the middle. There is the code from above (one digit, any number of additional digits and another digit which is the same as the first one). The very first parenthesis requires a non-word character before the first digit and the last parenthesis a non-word character after the last digit.

Obviously, lookaheads and lookbehinds can also occur individually or even multiple times in a regular expression. The only thing that does not work is a lookbehind with an unknown number of characters. (i.e., with things like *, + and {0,4})

All right? If not, then back to the beginning of this section and do not pass “Go” and do not sign up for regExes 200.

Negative lookaheads and lookbehinds

If you found that complicated and are slowly realizing how voodoo regular expressions are? It gets even better.

Here is another example: You want to check an e-mail address that is not from France. First, let’s look at a simple e-mail address check without the restriction:

[^@]+@.+\.[^.]+

(A character other than the @ sign, followed by an @ sign, any character (but at least one), a dot, and any character behind it, but no more. Matches awesome@example.com exactly as it does on many_.characters@even.more.points.example.com.) Important: the last pair of square brackets always eats the last part (the top-level domain, in the example “com”) of the email address.

Now we add a restriction and want to exclude “fr” at this point. The point may not follow “fr”.

[^@]+@.+\.(?!fr)[^.]+

does that for us. The ?! inside the parenthesis means, “not allowed to follow”.

This could also make a special search for a word.

\bF(?!eta\b).*\b

applies to all words that start with “F” but are not “Feta”. The following text highlights all strings that are taken:

Fett Feta False Feuchte Fuffziger

This was the “negative lookahead”.

And because it was so beautiful, I included a “negative lookbehind”:

.*(?<!Garbage)bucket

applies to all buckets in which no garbage belongs, ergo “meet at this point, if there is no “garbage” in front of it”.

I hope you find this useful as you begin to explore new and different ways to use RegEx.

PythonTrier

Upcoming Meets

Speakers needed

Group Resources

Conferences