06 Regular Expressions

Why Regular Expressions?

DFAs and NFAs are machines — they describe HOW to process a string. Regular expressions are patterns — they describe WHAT the language looks like. Same languages, different perspective.

You already know regex from programming (/^[a-z]+@[a-z]+\\.com$/), but formal regex in automata theory is simpler — no backreferences, no lookaheads, no character classes. Just three operations: union, concatenation, and Kleene star. That's enough to describe exactly the regular languages. Nothing more, nothing less.

Syntax: What's a Regular Expression?

Definition (recursive). $R$ is a regular expression over alphabet $\Sigma$ if $R$ is:

$a$ for some $a \in \Sigma$ — matches the single symbol $a$
$\varepsilon$ — matches the empty string
$\emptyset$ — matches nothing (empty language)
$R_1 \cup R_2$ — union (either $R_1$ or $R_2$)
$R_1 \circ R_2$ — concatenation ($R_1$ followed by $R_2$)
$R_1^*$ — Kleene star (zero or more repetitions of $R_1$)

Cases 1-3 are base cases, 4-6 are inductive cases. Every regex is built from atoms via these three operations.

Semantics: What Language Does It Describe?

The language of $R$:

$L(a) = \{a\}$
$L(\varepsilon) = \{\varepsilon\}$
$L(\emptyset) = \emptyset$
$L(R_1 \cup R_2) = L(R_1) \cup L(R_2)$
$L(R_1 \circ R_2) = L(R_1) \circ L(R_2) = \{xy \mid x \in L(R_1), y \in L(R_2)\}$
$L(R_1^*) = L(R_1)^*$

Operator Precedence

Precedence (highest to lowest):

$*$ — Kleene star (binds tightest)
$\circ$ — concatenation (often written without the dot: $ab$ means $a \circ b$)
$\cup$ — union (binds loosest)

Example: $ab^* \cup c$ means $(a(b^*)) \cup c$, not $a(b^* \cup c)$ or $(ab)^* \cup c$.

This is just like arithmetic: $*$ is like exponentiation, concatenation is like multiplication, and $\cup$ is like addition. You'd read $2 \cdot 3^2 + 4$ as $(2 \cdot (3^2)) + 4$.

Key Identities

Identity	Why
$\emptyset^* = \varepsilon$	Zero repetitions of nothing = empty string
$\varepsilon^* = \varepsilon$	Any number of empty strings = empty string
$R \cup \emptyset = R$	$\emptyset$ is identity for union (like 0 for addition)
$R \circ \varepsilon = \varepsilon \circ R = R$	$\varepsilon$ is identity for concatenation
$R \circ \emptyset = \emptyset \circ R = \emptyset$	Concatenation with empty language = empty
$R \cup R = R$	Union is idempotent
$(R^)^ = R^*$	Starring star doesn't add anything
$R^+ = R \circ R^*$	One or more = one, then zero or more

$\emptyset$ vs $\varepsilon$ — don't confuse them!

$\emptyset$ = empty language (no strings at all). It's the "zero" of regex.
$\varepsilon$ = language containing only the empty string. It's the "one" of concatenation.
$R \circ \emptyset = \emptyset$ (multiply by zero = zero)
$R \circ \varepsilon = R$ (multiply by one = identity)

The Equivalence Triangle

Regular Expression
↓ ↑
NFA ↔ DFA

All three describe exactly the regular languages.

Conversion paths:

Regex → NFA: Thompson's construction (this page)
NFA → DFA: Subset construction (page 04)
DFA → Regex: State elimination via GNFA (page 07)
NFA → Regex: Via GNFA directly (page 07)

Regex → NFA: Thompson's Construction

Build the NFA recursively, mirroring the regex structure. Each sub-regex becomes a small NFA; they're composed using $\varepsilon$-transitions.

Base Cases

Symbol $a$: Two states, one transition.

Empty string $\varepsilon$: Single accept state — accepts only $\varepsilon$.

digraph { rankdir=LR; bgcolor="transparent"; node [shape=circle fontname="Helvetica" fontsize=11 width=0.4 fixedsize=true]; __start [shape=none label="" width=0]; __start -> q0; q0 [label="q₀" shape=doublecircle style=filled fillcolor="#c8f7c5"]; }

Empty set $\emptyset$: Single non-accept state — rejects everything.

Inductive Cases

Union $R_1 \cup R_2$: New start with $\varepsilon$-transitions to both sub-NFAs, both feed into a new accept state.

digraph { rankdir=LR; bgcolor="transparent"; node [shape=circle fontname="Helvetica" fontsize=11 width=0.4 fixedsize=true]; edge [fontname="Helvetica" fontsize=10]; __start [shape=none label="" width=0]; __start -> q0; q0 [label="q₀" style=filled fillcolor="#d0e8ff"]; subgraph cluster_r1 { style=dashed; color="#9a9590"; label="NFA for R₁"; fontname="Helvetica"; fontsize=9; labeljust=l; s1 [label="s₁"]; f1 [label="f₁"]; s1 -> f1 [label="..." style=dashed color="#9a9590"]; } subgraph cluster_r2 { style=dashed; color="#9a9590"; label="NFA for R₂"; fontname="Helvetica"; fontsize=9; labeljust=l; s2 [label="s₂"]; f2 [label="f₂"]; s2 -> f2 [label="..." style=dashed color="#9a9590"]; } qf [label="q_f" shape=doublecircle style=filled fillcolor="#c8f7c5"]; q0 -> s1 [label="ε"]; q0 -> s2 [label="ε"]; f1 -> qf [label="ε"]; f2 -> qf [label="ε"]; }

Concatenation $R_1 \circ R_2$: Chain the two sub-NFAs — accept state of $R_1$ connects via $\varepsilon$ to start of $R_2$.

digraph { rankdir=LR; bgcolor="transparent"; node [shape=circle fontname="Helvetica" fontsize=11 width=0.4 fixedsize=true]; edge [fontname="Helvetica" fontsize=10]; __start [shape=none label="" width=0]; subgraph cluster_r1 { style=dashed; color="#9a9590"; label="NFA for R₁"; fontname="Helvetica"; fontsize=9; labeljust=l; s1 [label="s₁"]; f1 [label="f₁"]; s1 -> f1 [label="..." style=dashed color="#9a9590"]; } subgraph cluster_r2 { style=dashed; color="#9a9590"; label="NFA for R₂"; fontname="Helvetica"; fontsize=9; labeljust=l; s2 [label="s₂"]; f2 [label="f₂" shape=doublecircle style=filled fillcolor="#c8f7c5"]; s2 -> f2 [label="..." style=dashed color="#9a9590"]; } __start -> s1; f1 -> s2 [label="ε"]; }

Kleene star $R_1^*$: New start state (also accepting — handles $i=0$), $\varepsilon$ into sub-NFA, loop back from sub-NFA accept to sub-NFA start.

Every Thompson NFA has exactly one start state and one accept state, making composition clean.

Worked Examples

Example 1: $(ab \cup a)^*$

What language does this describe?

Inner: $ab \cup a$ = strings that are either "ab" or "a"
Star: zero or more repetitions of those pieces
Members: $\varepsilon$, $a$, $ab$, $aa$, $aab$, $aba$, $abab$, $aaba$, ...
Non-members: $b$, $ba$, $bb$, $abb$, ...
In words: strings that can be broken into chunks, each being "a" or "ab". Equivalently: strings of a's and b's where every b is immediately preceded by an a.

Example 2: $\Sigma^* 1 \Sigma^*$

Where $\Sigma = \{0, 1\}$, so $\Sigma^* = (0 \cup 1)^*$.

Language: all strings that contain at least one 1. The pattern says: anything, then a 1, then anything. This is one of the most common regex patterns — "contains X" = $\Sigma^* X \Sigma^*$.

Example 3: $(0 \cup 1)^* 0$

Language: all binary strings ending in 0. The star part matches any prefix, then we require a 0 at the end.

Regex Exercises

For each regex, click to reveal the language it describes.

1. $a^* b a^*$

2. $(aa)^*(bb)^* \cup (bb)^*(aa)^*$

3. $(\varepsilon \cup 0)(1 \cup 10)^*$

4. $\emptyset^* \cup a$

Formal Regex vs Programming Regex

Programming regex has features beyond formal regex:

[a-z], \d, . — character classes (just sugar for union)
+, ?, {n,m} — quantifiers (expressible with $\circ$ and $*$)
^, $ — anchors (context, not part of the language theory)
\1, \b — backreferences — these go BEYOND regular languages! The language $\{ww \mid w \in \Sigma^*\}$ is not regular, but PCRE can match it.

Formal regex = exactly regular languages. Programming regex (PCRE) = sometimes more, thanks to backreferences and lookaheads.

Summary

Regular expressions:

Base cases: $a$, $\varepsilon$, $\emptyset$
Operations: $\cup$ (union), $\circ$ (concat), $*$ (star)
Precedence: $*$ > $\circ$ > $\cup$
Regex ↔ NFA ↔ DFA: all describe regular languages
Regex → NFA: Thompson's construction (recursive, $\varepsilon$-based)
NFA → Regex: GNFA state elimination (next page)

← Regular Operations Index GNFA & State Elimination →

Identity	Why
\(\emptyset^* = \varepsilon\)	Zero repetitions of nothing = empty string
\(\varepsilon^* = \varepsilon\)	Any number of empty strings = empty string
\(R \cup \emptyset = R\)	\(\emptyset\) is identity for union (like 0 for addition)
\(R \circ \varepsilon = \varepsilon \circ R = R\)	\(\varepsilon\) is identity for concatenation
\(R \circ \emptyset = \emptyset \circ R = \emptyset\)	Concatenation with empty language = empty
\(R \cup R = R\)	Union is idempotent
\((R^)^ = R^*\)	Starring star doesn't add anything
\(R^+ = R \circ R^*\)	One or more = one, then zero or more

Semantics: What Language Does It Describe?

Base Cases

Inductive Cases

Example 1: \((ab \cup a)^*\)

Example 2: \(\Sigma^* 1 \Sigma^*\)

Example 3: \((0 \cup 1)^* 0\)

Regex Exercises