# Computation – Regular Expressions

September 9, 2009 Leave a comment

## Computers and Sets don’t go

Up till now we’ve only seen how to express languages as sets, such as:

- {(01)n | n Є N}

While we can all understand this, if you go ahead right now and type it into a computer then you’ll just get a lot of evil error messages thrown at you – no, not becuase you’re using *Windows, *but because we have to express the language in a very different way. This way of expressing is known a patterns!

## Patterns are the way!

Let’s take the example set from above. Now, we have already seen some other way to express this, namely:

- {01}*

With this, the star tells the computer to match a given string to an arbitrary number of 01’s, this could be anything from 1 – 100 repeats, or even 0, which is the empty word, ε.

Let’s say for example, that this is the only word in this language, if this statement is true than we can get rid of the bracer brackets and replace them for nice curly ones so that we now have:

- (01)*

Now this a computer can understand! All it has to do is compare a given string and check it like this:

- Is the first character a 0?
- Is the next one a 1?
- Is the one after that a 0?
- And so on….

This, as you may have already guessed, is a pattern. A pattern consists of various characters of some given alphabet, concatenated. We are allowed to apply the Kleene star (*) to any part of a given string, and we use the curly brackets ‘()’ to indicate which part of a string should be affected by the star.

## More patterns

Now, let’s say for example you have the following language, expressed in the following set:

- {0n | n Є N} U {1n | n Є N} = {xn | x=0 or x=1}

This is actually a little trickier than you may think. For example, we cannot use 0*1*, or even worse (01)*. This is because both of these include words that contain both 0 and 1. Whereas, with the described language, we need a pattern that can include words made up of nothing but 0s, or 1s.

For this, a new symbol is introduced. This symbol is ‘|’. Now that we have this symbol, we can create a new pattern!:

- 0*|1*

The best way to describe the ‘|’, is to thing of it as ‘0’ OR ‘1’, or in this case an:

- arbitrary number of 0s OR arbitrary number of 1s

A more in-depth look would be this. Take the following pattern (0|1|2). Now, the words that are accepted by this pattern are:

- ‘0’, ‘1’ and ‘2’
- NOT 012, 021, 210, 120, 12, 10…. Etc

Another example could be the pattern (0|1*|2), which accepts:

- ‘0’
- ‘2’
- (1)* so an arbitrary number of 1s

## Regular Expressions

Up till now, we have only seen the idea and usefulness of patterns. Now it is time to find out how we form patterns, and when a word matches a given pattern. Before we start, I need to introduce a new symbol, this is the empty set, which is the pattern which is matched by no word at all, and is denoted by Ø. Now its time for our forth definition:

———————————————————————————————–

Definition 4: Let Σ be a given alphabet. A pattern or regular expression over Σ, is any word over the set ΣPAT:

ΣPAT = Σ U {Ø, ε, |, *, (, )}

This is generated by the following definition:

- Empty Pattern: The character Ø is a pattern.
- Empty Word: The character ε is a pattern.
- Letters: Every letter from Σ is a pattern.
- Concatenation: if p1 and p2 are patterns then so is (p1p2)
- Alternative: if p1 and p2 are patterns then so is (p1 |p2)
- Kleene Star: if p is a pattern then so is (p)*

———————————————————————————————–

As you may have seen, we have just created a new language, namely ΣPAT; this is the language of all regular expressions.

This new language is actually recursively defined. In other words, we have a base case which is the empty set, Ø,or the empty word, ε, as well as all the letters of the underlying alphabet Σ.

The language is recursive, as it will have to call itself every time as a new character is met, and is then applied to the pattern. If all is well, we move to the next character, and yet another call of the language.

## Matching Regular Expressions

Please note, that in the definition of a pattern (definition 4), there is no meaning attached to any of the operators, or even the symbols that appear in the base cases. We only actually find out the intended meaning of these when we define a word that matches a pattern.

———————————————————————————————–

Definition 5: Let p be some pattern over some alphabet Σ, and then let s be some word over Σ. We can say that smatches p, if one of the following cases holds:

- Empty Word: The empty word, ε, matches the word ε
- Base Case: the pattern p is a single character, and s is that character
- Concatenation: the pattern p is a concatenation, so p=(p1p2). And there are words s1 and s2 such that s1matches p1, s2 matches p2, and that s is the concatenation of s1 and s2.
- Alternative: the pattern p is an alternative, so p=(p1|p2). And s matches p1 or p2, or both.
- Kleene Star: the pattern p is of the form p=(q)*, and s can be written as a finite concatenations=s1s2…sn such that s1, s2,…,sn all match q.
- Note: no word matches the empty set Ø.

———————————————————————————————–

## Languages Described by Regular Expressions

If we are given a pattern, we can now define a language based on this pattern.

———————————————————————————————–

Definition 6: Let p be a regular expression over some alphabet Σ. The language that is defined by the pattern p, which is expressed as ζ(p), is the set of all words over Σ that match p. So in other words:

ζ(p) = {s Є Σ* | s matches p}

So s is a word over Σ.

———————————————————————————————–

## Regular Languages

We have therefore, now defined a certain category of languages, namely those that can be defined using regular expressions. It actually turns out that these languages are important! So we therefore give them a name: “regular”.

———————————————————————————————–

Definition 7: A language ζ is regular if it is the set of all words matching some regular expression, that is, if there is a pattern p such that ζ = ζ(p).

———————————————————————————————–