06 — Regular Expressions
A declarative notation for describing regular languages — no machines needed
Why Regular Expressions?

DFAs and NFAs are machines — they describe HOW to process a string. Regular expressions are patterns — they describe WHAT the language looks like. Same languages, different perspective.

You already know regex from programming (/^[a-z]+@[a-z]+\\.com$/), but formal regex in automata theory is simpler — no backreferences, no lookaheads, no character classes. Just three operations: union, concatenation, and Kleene star. That's enough to describe exactly the regular languages. Nothing more, nothing less.

Syntax: What's a Regular Expression?
Definition (recursive). \(R\) is a regular expression over alphabet \(\Sigma\) if \(R\) is:
  1. \(a\) for some \(a \in \Sigma\) — matches the single symbol \(a\)
  2. \(\varepsilon\) — matches the empty string
  3. \(\emptyset\) — matches nothing (empty language)
  4. \(R_1 \cup R_2\) — union (either \(R_1\) or \(R_2\))
  5. \(R_1 \circ R_2\) — concatenation (\(R_1\) followed by \(R_2\))
  6. \(R_1^*\) — Kleene star (zero or more repetitions of \(R_1\))

Cases 1-3 are base cases, 4-6 are inductive cases. Every regex is built from atoms via these three operations.

Semantics: What Language Does It Describe?

The language of \(R\):
Operator Precedence
Precedence (highest to lowest):
  1. \(*\) — Kleene star (binds tightest)
  2. \(\circ\) — concatenation (often written without the dot: \(ab\) means \(a \circ b\))
  3. \(\cup\) — union (binds loosest)
Example: \(ab^* \cup c\) means \((a(b^*)) \cup c\), not \(a(b^* \cup c)\) or \((ab)^* \cup c\).

This is just like arithmetic: \(*\) is like exponentiation, concatenation is like multiplication, and \(\cup\) is like addition. You'd read \(2 \cdot 3^2 + 4\) as \((2 \cdot (3^2)) + 4\).

Key Identities
IdentityWhy
\(\emptyset^* = \varepsilon\)Zero repetitions of nothing = empty string
\(\varepsilon^* = \varepsilon\)Any number of empty strings = empty string
\(R \cup \emptyset = R\)\(\emptyset\) is identity for union (like 0 for addition)
\(R \circ \varepsilon = \varepsilon \circ R = R\)\(\varepsilon\) is identity for concatenation
\(R \circ \emptyset = \emptyset \circ R = \emptyset\)Concatenation with empty language = empty
\(R \cup R = R\)Union is idempotent
\((R^*)^* = R^*\)Starring star doesn't add anything
\(R^+ = R \circ R^*\)One or more = one, then zero or more
\(\emptyset\) vs \(\varepsilon\) — don't confuse them!
The Equivalence Triangle
Regular Expression
↓ ↑
NFA DFA

All three describe exactly the regular languages.

Conversion paths:

Regex → NFA: Thompson's Construction

Build the NFA recursively, mirroring the regex structure. Each sub-regex becomes a small NFA; they're composed using \(\varepsilon\)-transitions.

Base Cases

Symbol \(a\): Two states, one transition.

digraph { rankdir=LR; bgcolor="transparent"; node [shape=circle fontname="Helvetica" fontsize=11 width=0.4 fixedsize=true]; edge [fontname="Helvetica" fontsize=10]; __start [shape=none label="" width=0]; __start -> q0; q0 [label="q₀" style=filled fillcolor="#d0e8ff"]; q1 [label="q₁" shape=doublecircle style=filled fillcolor="#c8f7c5"]; q0 -> q1 [label="a"]; }

Empty string \(\varepsilon\): Single accept state — accepts only \(\varepsilon\).

digraph { rankdir=LR; bgcolor="transparent"; node [shape=circle fontname="Helvetica" fontsize=11 width=0.4 fixedsize=true]; __start [shape=none label="" width=0]; __start -> q0; q0 [label="q₀" shape=doublecircle style=filled fillcolor="#c8f7c5"]; }

Empty set \(\emptyset\): Single non-accept state — rejects everything.

digraph { rankdir=LR; bgcolor="transparent"; node [shape=circle fontname="Helvetica" fontsize=11 width=0.4 fixedsize=true]; __start [shape=none label="" width=0]; __start -> q0; q0 [label="q₀" style=filled fillcolor="#d0e8ff"]; }

Inductive Cases

Union \(R_1 \cup R_2\): New start with \(\varepsilon\)-transitions to both sub-NFAs, both feed into a new accept state.

digraph { rankdir=LR; bgcolor="transparent"; node [shape=circle fontname="Helvetica" fontsize=11 width=0.4 fixedsize=true]; edge [fontname="Helvetica" fontsize=10]; __start [shape=none label="" width=0]; __start -> q0; q0 [label="q₀" style=filled fillcolor="#d0e8ff"]; subgraph cluster_r1 { style=dashed; color="#9a9590"; label="NFA for R₁"; fontname="Helvetica"; fontsize=9; labeljust=l; s1 [label="s₁"]; f1 [label="f₁"]; s1 -> f1 [label="..." style=dashed color="#9a9590"]; } subgraph cluster_r2 { style=dashed; color="#9a9590"; label="NFA for R₂"; fontname="Helvetica"; fontsize=9; labeljust=l; s2 [label="s₂"]; f2 [label="f₂"]; s2 -> f2 [label="..." style=dashed color="#9a9590"]; } qf [label="q_f" shape=doublecircle style=filled fillcolor="#c8f7c5"]; q0 -> s1 [label="ε"]; q0 -> s2 [label="ε"]; f1 -> qf [label="ε"]; f2 -> qf [label="ε"]; }

Concatenation \(R_1 \circ R_2\): Chain the two sub-NFAs — accept state of \(R_1\) connects via \(\varepsilon\) to start of \(R_2\).

digraph { rankdir=LR; bgcolor="transparent"; node [shape=circle fontname="Helvetica" fontsize=11 width=0.4 fixedsize=true]; edge [fontname="Helvetica" fontsize=10]; __start [shape=none label="" width=0]; subgraph cluster_r1 { style=dashed; color="#9a9590"; label="NFA for R₁"; fontname="Helvetica"; fontsize=9; labeljust=l; s1 [label="s₁"]; f1 [label="f₁"]; s1 -> f1 [label="..." style=dashed color="#9a9590"]; } subgraph cluster_r2 { style=dashed; color="#9a9590"; label="NFA for R₂"; fontname="Helvetica"; fontsize=9; labeljust=l; s2 [label="s₂"]; f2 [label="f₂" shape=doublecircle style=filled fillcolor="#c8f7c5"]; s2 -> f2 [label="..." style=dashed color="#9a9590"]; } __start -> s1; f1 -> s2 [label="ε"]; }

Kleene star \(R_1^*\): New start state (also accepting — handles \(i=0\)), \(\varepsilon\) into sub-NFA, loop back from sub-NFA accept to sub-NFA start.

digraph { rankdir=LR; bgcolor="transparent"; node [shape=circle fontname="Helvetica" fontsize=11 width=0.4 fixedsize=true]; edge [fontname="Helvetica" fontsize=10]; __start [shape=none label="" width=0]; __start -> q0; q0 [label="q₀" shape=doublecircle style=filled fillcolor="#c8f7c5"]; subgraph cluster_r1 { style=dashed; color="#9a9590"; label="NFA for R₁"; fontname="Helvetica"; fontsize=9; labeljust=l; s1 [label="s₁"]; f1 [label="f₁"]; s1 -> f1 [label="..." style=dashed color="#9a9590"]; } q0 -> s1 [label="ε"]; f1 -> s1 [label="ε"]; f1 -> q0 [label="ε" constraint=false]; }

Every Thompson NFA has exactly one start state and one accept state, making composition clean.

Worked Examples

Example 1: \((ab \cup a)^*\)

What language does this describe?

Example 2: \(\Sigma^* 1 \Sigma^*\)

Where \(\Sigma = \{0, 1\}\), so \(\Sigma^* = (0 \cup 1)^*\).

Language: all strings that contain at least one 1. The pattern says: anything, then a 1, then anything. This is one of the most common regex patterns — "contains X" = \(\Sigma^* X \Sigma^*\).

Example 3: \((0 \cup 1)^* 0\)

Language: all binary strings ending in 0. The star part matches any prefix, then we require a 0 at the end.

Regex Exercises

For each regex, click to reveal the language it describes.

1. \(a^* b a^*\)
Strings over \(\{a,b\}\) with exactly one \(b\). Any number of a's, then one b, then any number of a's. Examples: \(b, ab, ba, aba, aaba\).
2. \((aa)^*(bb)^* \cup (bb)^*(aa)^*\)
Strings of even-length blocks of a's followed by even-length blocks of b's, OR even-length blocks of b's followed by even-length blocks of a's. Examples: \(\varepsilon, aa, bb, aabb, bbaa, aaaabb, bbaaaaaa\). Note: each block must have an even number of characters.
3. \((\varepsilon \cup 0)(1 \cup 10)^*\)
Strings that optionally start with 0, then have zero or more repetitions of "1" or "10". This generates all binary strings that don't contain "00" (consecutive zeros). Think about it: between any two zeros, there must be at least one 1.
4. \(\emptyset^* \cup a\)
\(\emptyset^* = \{\varepsilon\}\) (zero repetitions of nothing = empty string). So this is \(\{\varepsilon\} \cup \{a\} = \{\varepsilon, a\}\). Just two strings.
Formal Regex vs Programming Regex
Programming regex has features beyond formal regex: Formal regex = exactly regular languages. Programming regex (PCRE) = sometimes more, thanks to backreferences and lookaheads.
Summary
Regular expressions: