A declarative notation for describing regular languages — no machines needed
Why Regular Expressions?
DFAs and NFAs are machines — they describe HOW to process a string. Regular expressions are patterns — they describe WHAT the language looks like. Same languages, different perspective.
You already know regex from programming (/^[a-z]+@[a-z]+\\.com$/), but formal regex in automata theory is simpler — no backreferences, no lookaheads, no character classes. Just three operations: union, concatenation, and Kleene star. That's enough to describe exactly the regular languages. Nothing more, nothing less.
Syntax: What's a Regular Expression?
Definition (recursive). \(R\) is a regular expression over alphabet \(\Sigma\) if \(R\) is:
\(a\) for some \(a \in \Sigma\) — matches the single symbol \(a\)
\(\varepsilon\) — matches the empty string
\(\emptyset\) — matches nothing (empty language)
\(R_1 \cup R_2\) — union (either \(R_1\) or \(R_2\))
\(R_1 \circ R_2\) — concatenation (\(R_1\) followed by \(R_2\))
\(R_1^*\) — Kleene star (zero or more repetitions of \(R_1\))
Cases 1-3 are base cases, 4-6 are inductive cases. Every regex is built from atoms via these three operations.
Semantics: What Language Does It Describe?
The language of \(R\):
\(L(a) = \{a\}\)
\(L(\varepsilon) = \{\varepsilon\}\)
\(L(\emptyset) = \emptyset\)
\(L(R_1 \cup R_2) = L(R_1) \cup L(R_2)\)
\(L(R_1 \circ R_2) = L(R_1) \circ L(R_2) = \{xy \mid x \in L(R_1), y \in L(R_2)\}\)
\(L(R_1^*) = L(R_1)^*\)
Operator Precedence
Precedence (highest to lowest):
\(*\) — Kleene star (binds tightest)
\(\circ\) — concatenation (often written without the dot: \(ab\) means \(a \circ b\))
\(\cup\) — union (binds loosest)
Example: \(ab^* \cup c\) means \((a(b^*)) \cup c\), not \(a(b^* \cup c)\) or \((ab)^* \cup c\).
This is just like arithmetic: \(*\) is like exponentiation, concatenation is like multiplication, and \(\cup\) is like addition. You'd read \(2 \cdot 3^2 + 4\) as \((2 \cdot (3^2)) + 4\).
Key Identities
Identity
Why
\(\emptyset^* = \varepsilon\)
Zero repetitions of nothing = empty string
\(\varepsilon^* = \varepsilon\)
Any number of empty strings = empty string
\(R \cup \emptyset = R\)
\(\emptyset\) is identity for union (like 0 for addition)
\(R \circ \varepsilon = \varepsilon \circ R = R\)
\(\varepsilon\) is identity for concatenation
\(R \circ \emptyset = \emptyset \circ R = \emptyset\)
Concatenation with empty language = empty
\(R \cup R = R\)
Union is idempotent
\((R^*)^* = R^*\)
Starring star doesn't add anything
\(R^+ = R \circ R^*\)
One or more = one, then zero or more
\(\emptyset\) vs \(\varepsilon\) — don't confuse them!
\(\emptyset\) = empty language (no strings at all). It's the "zero" of regex.
\(\varepsilon\) = language containing only the empty string. It's the "one" of concatenation.
\(R \circ \emptyset = \emptyset\) (multiply by zero = zero)
\(R \circ \varepsilon = R\) (multiply by one = identity)
The Equivalence Triangle
Regular Expression ↓ ↑ NFA↔DFA
All three describe exactly the regular languages.
Conversion paths:
Regex → NFA: Thompson's construction (this page)
NFA → DFA: Subset construction (page 04)
DFA → Regex: State elimination via GNFA (page 07)
NFA → Regex: Via GNFA directly (page 07)
Regex → NFA: Thompson's Construction
Build the NFA recursively, mirroring the regex structure. Each sub-regex becomes a small NFA; they're composed using \(\varepsilon\)-transitions.
Kleene star \(R_1^*\): New start state (also accepting — handles \(i=0\)), \(\varepsilon\) into sub-NFA, loop back from sub-NFA accept to sub-NFA start.
In words: strings that can be broken into chunks, each being "a" or "ab". Equivalently: strings of a's and b's where every b is immediately preceded by an a.
Example 2: \(\Sigma^* 1 \Sigma^*\)
Where \(\Sigma = \{0, 1\}\), so \(\Sigma^* = (0 \cup 1)^*\).
Language: all strings that contain at least one 1. The pattern says: anything, then a 1, then anything. This is one of the most common regex patterns — "contains X" = \(\Sigma^* X \Sigma^*\).
Example 3: \((0 \cup 1)^* 0\)
Language: all binary strings ending in 0. The star part matches any prefix, then we require a 0 at the end.
Regex Exercises
For each regex, click to reveal the language it describes.
1. \(a^* b a^*\)
Strings over \(\{a,b\}\) with exactly one \(b\). Any number of a's, then one b, then any number of a's. Examples: \(b, ab, ba, aba, aaba\).
2. \((aa)^*(bb)^* \cup (bb)^*(aa)^*\)
Strings of even-length blocks of a's followed by even-length blocks of b's, OR even-length blocks of b's followed by even-length blocks of a's. Examples: \(\varepsilon, aa, bb, aabb, bbaa, aaaabb, bbaaaaaa\). Note: each block must have an even number of characters.
3. \((\varepsilon \cup 0)(1 \cup 10)^*\)
Strings that optionally start with 0, then have zero or more repetitions of "1" or "10". This generates all binary strings that don't contain "00" (consecutive zeros). Think about it: between any two zeros, there must be at least one 1.
4. \(\emptyset^* \cup a\)
\(\emptyset^* = \{\varepsilon\}\) (zero repetitions of nothing = empty string). So this is \(\{\varepsilon\} \cup \{a\} = \{\varepsilon, a\}\). Just two strings.
Formal Regex vs Programming Regex
Programming regex has features beyond formal regex:
[a-z], \d, . — character classes (just sugar for union)
+, ?, {n,m} — quantifiers (expressible with \(\circ\) and \(*\))
^, $ — anchors (context, not part of the language theory)
\1, \b — backreferences — these go BEYOND regular languages! The language \(\{ww \mid w \in \Sigma^*\}\) is not regular, but PCRE can match it.
Formal regex = exactly regular languages. Programming regex (PCRE) = sometimes more, thanks to backreferences and lookaheads.