Beyond regular languages — grammars that can count and nest
Why CFGs Exist
In page 08, we proved that \(\{0^n1^n \mid n \geq 0\}\) is not regular. No DFA, NFA, or regex can recognize it. The core issue: finite automata have finite memory, so they can't count unboundedly. The language requires matching the exact count of 0s and 1s, and a DFA simply can't do that.
But this language isn't exotic or pathological — it's the pattern behind every matching structure you encounter in real software:
Balanced parentheses — ((())) is valid, (() is not. A linter needs to count nesting depth.
Programming language syntax — if { if { ... } } has nested blocks. The parser must match each opening brace with its closing brace.
HTML tags — <div><p>...</p></div> requires matching open/close tags at arbitrary depth.
Natural language grammar — "The boy [that the girl [that the teacher taught] saw] ran" — nested clauses with matching subjects.
We need a formalism more powerful than regular languages. Context-free grammars are exactly that. They sit one level above regular languages in the Chomsky hierarchy — they can describe everything a regex can, plus recursive and nested structure.
The key idea: Instead of a machine that reads input and changes state (DFA/NFA), a CFG works in the opposite direction — it generates strings by repeatedly rewriting variables according to rules. The set of all strings a grammar can generate is its language. If a string has nested structure, the grammar captures that nesting through recursive rules.
Formal Definition
Definition (Sipser 2.2). A context-free grammar (CFG) is a 4-tuple \(G = (V, \Sigma, R, S)\) where:
\(V\) — a finite set of variables (also called nonterminals)
\(\Sigma\) — a finite set of terminals, disjoint from \(V\) (i.e., \(V \cap \Sigma = \emptyset\))
\(R\) — a finite set of rules (also called productions), where each rule is of the form
\[A \to w \quad \text{with } A \in V \text{ and } w \in (V \cup \Sigma)^*\]
\(S \in V\) — the start variable
Let's unpack this compared to what you know:
Variables \(V\) vs. Terminals \(\Sigma\)
Variables are placeholders that get rewritten — they never appear in the final output string. Terminals are the actual symbols of the language's alphabet (the \(\Sigma\) from page 01). Think of it like template expansion: variables are the template tags, terminals are the literal text that stays.
Common pitfalls (from the TU Delft lectures):
The start variable \(S\) is in \(V\) — it's a variable, not separate from them.
\(\varepsilon\) is not in \(\Sigma\). It can appear on the right-hand side of a rule (meaning "this variable can vanish"), but it's not a terminal symbol.
Each rule has a single variable on the left-hand side. That's what makes it "context-free" — the variable can be replaced regardless of what's around it (its context).
Rules \(R\)
Each rule says: "variable \(A\) can be replaced by string \(w\)." The right-hand side \(w\) can be any combination of variables and terminals — including \(\varepsilon\) (meaning the variable can disappear). The shorthand \(A \to w_1 \mid w_2 \mid w_3\) means three separate rules: \(A \to w_1\), \(A \to w_2\), and \(A \to w_3\).
Counting rules carefully. The notation \(S \to X \mid Y\) is two rules, not one. Similarly, \(X \to aXa \mid bYb \mid \varepsilon\) is three rules. Count each alternative after a \(\mid\) as a separate rule.
Example from lectures: Given \(V = \{S, X, Y\}\), \(\Sigma = \{a, b\}\), and:
\[S \to X \mid Y \qquad X \to aXa \mid bYb \mid \varepsilon \qquad Y \to aXb \mid bYa \mid \varepsilon\]
That's \(2 + 3 + 3 = 8\) rules total.
What makes a rule "context-free"?
The left-hand side is always a single variable. Never a terminal, never two symbols, never a variable with context around it. Compare:
\(A \to aBc\) — valid CF rule (single variable on the left)
\(aA \to aBc\) — not context-free (terminal a on the left — this is a context-sensitive rule)
\(AB \to CD\) — not context-free (two variables on the left)
The name "context-free" means: the rule \(A \to w\) applies regardless of what surrounds \(A\). Whether \(A\) appears as \(\ldots xAy \ldots\) or \(\ldots zAw \ldots\), you can always replace it with \(w\).
How Derivations Work
A derivation starts from the start variable and repeatedly applies rules until only terminals remain. Each step replaces one variable with the right-hand side of one of its rules.
Notation:
\(u \Rightarrow v\) means "\(u\) yields \(v\) in one step" — exactly one variable in \(u\) was replaced using a rule
\(u \overset{*}{\Rightarrow} v\) means "\(u\) derives \(v\)" — zero or more steps from \(u\) to \(v\)
If \(u, v, w\) are strings of variables and terminals, and \(A \to x\) is a rule, then:
\[uAv \Rightarrow uxv\]
The variable \(A\) is replaced by \(x\), and the surrounding context \(u, v\) stays unchanged.
Worked derivation: \(\{0^n1^n \mid n \geq 0\}\)
Grammar \(G_1 = (\{S\}, \{0, 1\}, R, S)\) with rules:
Each application of \(S \to 0S1\) wraps the current string with a matching 0-1 pair. The \(S \to \varepsilon\) rule terminates the recursion. This is why CFGs can "count" — the recursive nesting naturally produces matched pairs.
The language of a grammar. The language generated by \(G\) is:
\[L(G) = \{w \in \Sigma^* \mid S \overset{*}{\Rightarrow} w\}\]
That is, all strings of terminals that can be derived from the start variable. A language \(L\) is context-free if there exists a CFG \(G\) with \(L = L(G)\).
Context-Free Languages (CFLs)
Definition. A language \(L\) is context-free if there exists a CFG \(G\) such that \(L = L(G)\).
The class of context-free languages strictly contains the class of regular languages:
Given any DFA \(M = (Q, \Sigma, \delta, q_0, F)\), you can build an equivalent CFG:
Create a variable \(R_i\) for each state \(q_i\)
For each transition \(\delta(q_i, a) = q_j\), add the rule \(R_i \to aR_j\)
For each accept state \(q_i \in F\), add the rule \(R_i \to \varepsilon\)
The start variable is \(R_0\) (corresponding to the start state)
This grammar generates exactly \(L(M)\). So anything a DFA can recognize, a CFG can generate. The reverse is not true — \(\{0^n1^n\}\) is context-free but not regular.
The containment is strict. CFLs are more powerful than regular languages, but they're still limited. Some languages are not context-free at all — for example, \(\{a^n b^n c^n \mid n \geq 0\}\) (matching three counts simultaneously) requires even more power. There's a pumping lemma for CFLs too (covered later in the course).
Parse Trees
A derivation is a linear sequence of steps, but the structure it captures is actually a tree. A parse tree (also called a derivation tree) makes the hierarchical structure explicit.
Parse tree rules:
The root is labeled with the start variable
Each internal node is labeled with a variable
Each leaf is labeled with a terminal or \(\varepsilon\)
If a node is labeled \(A\) and its children (left to right) are \(X_1, X_2, \ldots, X_k\), then \(A \to X_1 X_2 \cdots X_k\) must be a rule in \(R\)
The yield of the tree = the leaves read left to right = the derived string
Parse tree for \(000111\) using grammar \(S \to 0S1 \mid \varepsilon\):
Reading the leaves left to right: \(0, 0, 0, \varepsilon, 1, 1, 1 = 000111\). The tree makes the nesting structure visible — each \(S\) node corresponds to one matched 0-1 pair.
Here's a more complex example from the lectures. Grammar \(G = (\{S, X, Y\}, \{a, b\}, R, S)\) with:
\[S \to X \mid Y \qquad X \to aXa \mid bYb \mid \varepsilon \qquad Y \to aXb \mid bYa \mid \varepsilon\]
Yield: \(a, b, \varepsilon, b, a = abba\). The derivation was \(S \Rightarrow X \Rightarrow aXa \Rightarrow abYba \Rightarrow abba\).
Leftmost and Rightmost Derivations
When a string being derived contains multiple variables, you have a choice: which variable to replace next? Two canonical orderings:
Leftmost derivation: At every step, replace the leftmost variable. Rightmost derivation: At every step, replace the rightmost variable.
Consider grammar \(G\) with rules: \(S \to aA \mid b\), \(A \to a \mid bB\), \(B \to aA \mid b\).
To derive the string \(ababababb\):
Leftmost derivation:
S ⇒ aA ⇒ abB ⇒ abaA ⇒ ababB ⇒ ababaA ⇒ abababB ⇒ abababaA ⇒ ababababB ⇒ ababababb
Different derivation orders can produce the same parse tree. A leftmost and rightmost derivation of the same string may look different as step sequences, but if they correspond to the same parse tree, they represent the same structural interpretation of the string.
Key insight: The parse tree captures structure. The derivation order (left-to-right vs. right-to-left) is just a traversal order of that same tree. This distinction matters for defining ambiguity — see next section.
Ambiguity
Sometimes a grammar can produce the same string through different parse trees. When this happens, the string has two different structural interpretations — a problem if your grammar is supposed to define the meaning of programs.
Definition (Sipser 2.7). A string \(w\) is derived ambiguously in grammar \(G\) if it has two or more different leftmost derivations (equivalently, two or more different parse trees). Grammar \(G\) is ambiguous if it generates some string ambiguously.
Tree 1 evaluates as \((a + a) \times a\) — addition first. Tree 2 evaluates as \(a + (a \times a)\) — multiplication first. For a compiler, this ambiguity is a real bug: the program's meaning depends on which parse tree you pick.
Fixing ambiguity with grammar design
The unambiguous grammar \(G_4\) generates the same language but enforces standard precedence (\(\times\) before \(+\)):
Now a+a×a has exactly one parse tree, and it groups \(\times\) tighter than \(+\). The trick: introduce a variable hierarchy (\(\text{EXPR} \to \text{TERM} \to \text{FACTOR}\)) where operators at each level can only combine operands from the level below.
Parse tree for a+a×a using the unambiguous grammar \(G_4\):
Notice: \(\times\) sits deeper in the tree (under TERM), so it binds tighter. \(+\) sits at the top (EXPR level). Only one tree is possible — the hierarchy forces the grouping \(a + (a \times a)\).
The general structure — how the variable hierarchy enforces precedence:
EXPR
handles + — combines TERMs
EXPR → EXPR + TERM | TERM
↓ falls through to
TERM
handles × — combines FACTORs
TERM → TERM × FACTOR | FACTOR
↓ falls through to
FACTOR
terminals or ( ) to reset
FACTOR → ( EXPR ) | a
↓ reaches terminal a or ( ) jumps back to top ↑
Each level can only combine things from the level below. \(+\) combines TERMs, \(\times\) combines FACTORs. Parentheses ( ) are the escape hatch — they wrap an EXPR back into a FACTOR, resetting the precedence. This is how every programming language parser works.
Inherent ambiguity. Some context-free languages have the property that every grammar for them is ambiguous — no rewriting of the rules can fix it. Such languages are called inherently ambiguous. Example: \(\{a^i b^j c^k \mid i = j \text{ or } j = k\}\). Proving inherent ambiguity is hard — it's an advanced topic.
Chomsky Normal Form (CNF)
CFG rules can have any mix of variables and terminals on the right-hand side, making them hard to work with algorithmically. Chomsky Normal Form restricts rules to a very simple shape — and every CFG can be converted to it.
Definition (Sipser 2.8). A CFG is in Chomsky normal form if every rule is of the form:
\[A \to BC \qquad \text{or} \qquad A \to a\]
where \(A, B, C\) are variables (with \(B, C\) not the start variable), and \(a\) is a terminal. Additionally, the rule \(S \to \varepsilon\) is permitted if \(S\) is the start variable.
In other words: every rule either produces exactly two variables or exactly one terminal. Nothing else. No mixed rules like \(A \to aBc\), no long rules like \(A \to BCDE\), no \(\varepsilon\)-rules (except for the start variable).
Why CNF matters
Parsing algorithms: The CYK (Cocke-Younger-Kasami) algorithm requires CNF and runs in \(O(n^3)\) time — it's a practical algorithm for deciding whether a string is in a CFL.
Proofs: Many proofs about CFLs are easier when you can assume every rule has a simple, uniform shape.
Derivation length is predictable: For a string of length \(n\), a CNF grammar uses exactly \(2n - 1\) rule applications (each \(A \to BC\) increases the string length by 1, each \(A \to a\) converts one variable to a terminal).
Conversion to CNF (sketch)
Theorem (Sipser 2.9). Any context-free language is generated by a context-free grammar in Chomsky normal form.
The conversion proceeds in four stages:
Add a new start variable. Create \(S_0 \to S\). This ensures the start variable doesn't appear on the right-hand side of any rule.
Remove \(\varepsilon\)-rules. For each rule \(A \to \varepsilon\) (where \(A\) is not the start variable), find every rule that uses \(A\) on the right-hand side, and add versions with \(A\) deleted. E.g., if \(R \to uAv\) exists, add \(R \to uv\). Repeat until no \(\varepsilon\)-rules remain (except possibly \(S_0 \to \varepsilon\)).
Remove unit rules. A unit rule is \(A \to B\) (one variable to one variable). If \(A \to B\) and \(B \to u\), add \(A \to u\) (unless it's a unit rule we already removed). Repeat until no unit rules remain.
Convert remaining rules to proper form. For rules with right-hand sides of length \(\geq 3\), like \(A \to u_1 u_2 \cdots u_k\), introduce new variables to chain them into pairs: \(A \to u_1 A_1\), \(A_1 \to u_2 A_2\), ..., \(A_{k-2} \to u_{k-1} u_k\). For any terminal \(a\) that appears in a rule alongside variables, replace it with a fresh variable \(U_a\) and add \(U_a \to a\).
Worked example (Sipser 2.10). Convert this grammar to CNF:
\(S \to ASA \mid aB\), \(A \to B \mid S\), \(B \to b \mid \varepsilon\)
Step 1 — New start: add \(S_0 \to S\).
Step 2 — Remove \(B \to \varepsilon\): add versions of rules using \(B\) with \(B\) deleted. \(S \to aB\) gives \(S \to a\). \(A \to B\) gives \(A \to \varepsilon\). Now remove \(A \to \varepsilon\): \(S \to ASA\) gives \(S \to SA \mid AS \mid S\). Continue until all \(\varepsilon\)-rules eliminated.
Step 3 — Remove unit rules: \(S_0 \to S\), \(A \to B\), \(A \to S\), \(S \to S\). Replace each with the non-unit rules they lead to.
Step 4 — Fix long rules and mixed rules. Final result:
\(S_0 \to AA_1 \mid UB \mid a \mid SA \mid AS\)
\(S \to AA_1 \mid UB \mid a \mid SA \mid AS\)
\(A \to b \mid AA_1 \mid UB \mid a \mid SA \mid AS\)
\(A_1 \to SA\), \(U \to a\), \(B \to b\)
Worked Examples
Example 1: \(\{0^n1^n \mid n \geq 0\}\)
Grammar: \(S \to 0S1 \mid \varepsilon\)
This is the canonical example of a CFL that is not regular. The recursive rule \(S \to 0S1\) wraps matched pairs — it's the grammatical equivalent of push/pop on a stack. Each application adds one 0 to the left and one 1 to the right, maintaining the count invariant. The \(\varepsilon\)-rule is the base case.
This grammar captures operator precedence: \(\times\) binds tighter than \(+\) because \(\times\) is handled at the \(\text{TERM}\) level (closer to the leaves) while \(+\) is at the \(\text{EXPR}\) level (closer to the root).
Derivation of a+a×a:
EXPR ⇒ EXPR + TERM ⇒ TERM + TERM ⇒ FACTOR + TERM
⇒ a + TERM ⇒ a + TERM × FACTOR ⇒ a + FACTOR × FACTOR
⇒ a + a × FACTOR ⇒ a + a × a
This derivation produces a single parse tree that groups as \(a + (a \times a)\) — correct precedence.
Interactive: Step Through a Derivation
Grammar for \(\{0^n1^n \mid n \geq 0\}\): \(S \to 0S1 \mid \varepsilon\)
Watch the derivation build string \(0^n1^n\) step by step.
Interactive: CFG Membership Checker
Test whether a string belongs to \(L(G)\) for the balanced parentheses grammar: \(S \to (S) \mid SS \mid \varepsilon\)
Exercises
Exercise 1: Design a CFG
Give a context-free grammar for the language \(L = \{a^i b^j \mid i \neq j,\ i, j \geq 0\}\) over \(\Sigma = \{a, b\}\).
Hint: Split into two cases — more \(a\)'s than \(b\)'s, or more \(b\)'s than \(a\)'s. Use the union technique: design separate grammars and combine with \(S \to S_1 \mid S_2\).
Key idea: \(L = \{a^i b^j \mid i > j\} \cup \{a^i b^j \mid i < j\}\).
For more \(a\)'s: generate at least one extra \(a\), then matched \(a^n b^n\).
For more \(b\)'s: matched \(a^n b^n\), then at least one extra \(b\).
\(S \to S_1 \mid S_2\)
\(S_1 \to aS_1 \mid aT\) (at least one extra \(a\), then matched pairs)
\(S_2 \to S_2 b \mid Tb\) (matched pairs, then at least one extra \(b\))
\(T \to aTb \mid \varepsilon\) (generates \(a^n b^n\))
Check: \(aab = a \cdot ab\): \(S \Rightarrow S_1 \Rightarrow aT \Rightarrow aaTb \Rightarrow aab\). Correct!
Check: \(abb = ab \cdot b\): \(S \Rightarrow S_2 \Rightarrow Tb \Rightarrow aTbb \Rightarrow abb\). Correct!
Check: \(\varepsilon\) or \(ab\): Neither \(S_1\) nor \(S_2\) can generate these. Correct — they have \(i = j\).
Exercise 2: Identify the Language
What language does this grammar generate?
\(S \to aSa \mid bSb \mid a \mid b \mid \varepsilon\)
Answer: The language of all palindromes over \(\{a, b\}\).
The rules \(S \to aSa\) and \(S \to bSb\) wrap matching symbols on both sides — building the palindrome from the outside in. The rules \(S \to a\), \(S \to b\), and \(S \to \varepsilon\) handle the center:
- \(S \to \varepsilon\): even-length palindromes (e.g., \(abba\))
- \(S \to a\) or \(S \to b\): odd-length palindromes (e.g., \(aba\), \(abcba\))
Example: \(abba\): \(S \Rightarrow aSa \Rightarrow abSba \Rightarrow abba\)
Example: \(aba\): \(S \Rightarrow aSa \Rightarrow aba\)
Note: \(\{w \in \{a,b\}^* \mid w = w^R\}\) is context-free but NOT regular. No DFA can check palindromes — it would need to remember the entire first half of the string to compare with the second half. That requires unbounded memory.
Exercise 3: Parse Tree from Rules
Given grammar \(G = (\{S, A, B\}, \{a, b\}, P, S)\) with rules:
\(S \to aA \mid b\), \(A \to a \mid bB\), \(B \to aA \mid b\)
Give a leftmost derivation for the string \(abababb\), and state how many rule applications are used.
Leftmost derivation:
\(S \Rightarrow aA \Rightarrow abB \Rightarrow abaA \Rightarrow ababB \Rightarrow ababaA \Rightarrow abababB \Rightarrow abababb\)
That's 7 rule applications:
1. \(S \to aA\)
2. \(A \to bB\)
3. \(B \to aA\)
4. \(A \to bB\)
5. \(B \to aA\)
6. \(A \to bB\)
7. \(B \to b\)
Notice the alternating pattern: \(S/A/B\) form a chain that generates alternating \(a\)'s and \(b\)'s. The grammar generates strings of the form \((ab)^n abb\) and \((ab)^n a\) plus the single string \(b\). In other words, strings that start with \(ab\) repeating and end with either \(abb\) or \(a\) or are just \(b\).
Rules: always of the form \(A \to w\) where \(A\) is a single variable — that's the "context-free" part
Derivations: start from \(S\), apply rules until only terminals remain; \(\Rightarrow\) = one step, \(\overset{*}{\Rightarrow}\) = zero or more
Parse trees: tree representation of a derivation; leaves left-to-right = derived string
Ambiguity: a grammar is ambiguous if some string has two different parse trees (= two different leftmost derivations)
Chomsky Normal Form: every rule is \(A \to BC\) or \(A \to a\); every CFG can be converted to CNF
CFLs contain regular languages: every DFA can be converted to an equivalent CFG, but CFGs can also handle \(\{0^n1^n\}\) and other non-regular patterns
Power source: recursive rules like \(S \to 0S1\) — CFGs can "count" by nesting matched pairs