09 — Context-Free Grammars (CFGs)
Beyond regular languages — grammars that can count and nest
Why CFGs Exist

In page 08, we proved that \(\{0^n1^n \mid n \geq 0\}\) is not regular. No DFA, NFA, or regex can recognize it. The core issue: finite automata have finite memory, so they can't count unboundedly. The language requires matching the exact count of 0s and 1s, and a DFA simply can't do that.

But this language isn't exotic or pathological — it's the pattern behind every matching structure you encounter in real software:

We need a formalism more powerful than regular languages. Context-free grammars are exactly that. They sit one level above regular languages in the Chomsky hierarchy — they can describe everything a regex can, plus recursive and nested structure.

The key idea: Instead of a machine that reads input and changes state (DFA/NFA), a CFG works in the opposite direction — it generates strings by repeatedly rewriting variables according to rules. The set of all strings a grammar can generate is its language. If a string has nested structure, the grammar captures that nesting through recursive rules.
Formal Definition
Definition (Sipser 2.2). A context-free grammar (CFG) is a 4-tuple \(G = (V, \Sigma, R, S)\) where:
  1. \(V\) — a finite set of variables (also called nonterminals)
  2. \(\Sigma\) — a finite set of terminals, disjoint from \(V\) (i.e., \(V \cap \Sigma = \emptyset\))
  3. \(R\) — a finite set of rules (also called productions), where each rule is of the form \[A \to w \quad \text{with } A \in V \text{ and } w \in (V \cup \Sigma)^*\]
  4. \(S \in V\) — the start variable

Let's unpack this compared to what you know:

Variables \(V\) vs. Terminals \(\Sigma\)

Variables are placeholders that get rewritten — they never appear in the final output string. Terminals are the actual symbols of the language's alphabet (the \(\Sigma\) from page 01). Think of it like template expansion: variables are the template tags, terminals are the literal text that stays.

Common pitfalls (from the TU Delft lectures):

Rules \(R\)

Each rule says: "variable \(A\) can be replaced by string \(w\)." The right-hand side \(w\) can be any combination of variables and terminals — including \(\varepsilon\) (meaning the variable can disappear). The shorthand \(A \to w_1 \mid w_2 \mid w_3\) means three separate rules: \(A \to w_1\), \(A \to w_2\), and \(A \to w_3\).

Counting rules carefully. The notation \(S \to X \mid Y\) is two rules, not one. Similarly, \(X \to aXa \mid bYb \mid \varepsilon\) is three rules. Count each alternative after a \(\mid\) as a separate rule.

Example from lectures: Given \(V = \{S, X, Y\}\), \(\Sigma = \{a, b\}\), and: \[S \to X \mid Y \qquad X \to aXa \mid bYb \mid \varepsilon \qquad Y \to aXb \mid bYa \mid \varepsilon\] That's \(2 + 3 + 3 = 8\) rules total.

What makes a rule "context-free"?

The left-hand side is always a single variable. Never a terminal, never two symbols, never a variable with context around it. Compare:

The name "context-free" means: the rule \(A \to w\) applies regardless of what surrounds \(A\). Whether \(A\) appears as \(\ldots xAy \ldots\) or \(\ldots zAw \ldots\), you can always replace it with \(w\).

How Derivations Work

A derivation starts from the start variable and repeatedly applies rules until only terminals remain. Each step replaces one variable with the right-hand side of one of its rules.

Notation: If \(u, v, w\) are strings of variables and terminals, and \(A \to x\) is a rule, then: \[uAv \Rightarrow uxv\] The variable \(A\) is replaced by \(x\), and the surrounding context \(u, v\) stays unchanged.

Worked derivation: \(\{0^n1^n \mid n \geq 0\}\)

Grammar \(G_1 = (\{S\}, \{0, 1\}, R, S)\) with rules:

\[S \to 0S1 \mid \varepsilon\]

Let's derive the string \(000111\):

S
  ⇒ 0S1    (using \(S \to 0S1\))
  ⇒ 00S11    (using \(S \to 0S1\))
  ⇒ 000S111    (using \(S \to 0S1\))
  ⇒ 000111    (using \(S \to \varepsilon\))

Each application of \(S \to 0S1\) wraps the current string with a matching 0-1 pair. The \(S \to \varepsilon\) rule terminates the recursion. This is why CFGs can "count" — the recursive nesting naturally produces matched pairs.

The language of a grammar. The language generated by \(G\) is: \[L(G) = \{w \in \Sigma^* \mid S \overset{*}{\Rightarrow} w\}\] That is, all strings of terminals that can be derived from the start variable. A language \(L\) is context-free if there exists a CFG \(G\) with \(L = L(G)\).
Context-Free Languages (CFLs)
Definition. A language \(L\) is context-free if there exists a CFG \(G\) such that \(L = L(G)\).

The class of context-free languages strictly contains the class of regular languages:

\[\text{Regular languages} \subset \text{Context-free languages}\]
Context-Free Languages
\(\{0^n1^n\}\), balanced parens, programming syntax, ...
Regular Languages
DFA/NFA/regex: \(0^*1^*\), even parity, ...

Why every regular language is context-free

Given any DFA \(M = (Q, \Sigma, \delta, q_0, F)\), you can build an equivalent CFG:

This grammar generates exactly \(L(M)\). So anything a DFA can recognize, a CFG can generate. The reverse is not true — \(\{0^n1^n\}\) is context-free but not regular.

The containment is strict. CFLs are more powerful than regular languages, but they're still limited. Some languages are not context-free at all — for example, \(\{a^n b^n c^n \mid n \geq 0\}\) (matching three counts simultaneously) requires even more power. There's a pumping lemma for CFLs too (covered later in the course).
Parse Trees

A derivation is a linear sequence of steps, but the structure it captures is actually a tree. A parse tree (also called a derivation tree) makes the hierarchical structure explicit.

Parse tree rules:

Parse tree for \(000111\) using grammar \(S \to 0S1 \mid \varepsilon\):

digraph { rankdir=TB; size="3,4"; node [shape=none fontname="IBM Plex Sans" fontsize=12 margin="0.05,0.02"]; edge [arrowhead=none]; nodesep=0.25; ranksep=0.3; S0 [label="S"]; S0 -> t00; S0 -> S1; S0 -> t01; t00 [label="0" fontcolor="#16a34a" fontname="IBM Plex Mono"]; t01 [label="1" fontcolor="#2563eb" fontname="IBM Plex Mono"]; S1 [label="S"]; S1 -> t10; S1 -> S2; S1 -> t11; t10 [label="0" fontcolor="#16a34a" fontname="IBM Plex Mono"]; t11 [label="1" fontcolor="#2563eb" fontname="IBM Plex Mono"]; S2 [label="S"]; S2 -> t20; S2 -> S3; S2 -> t21; t20 [label="0" fontcolor="#16a34a" fontname="IBM Plex Mono"]; t21 [label="1" fontcolor="#2563eb" fontname="IBM Plex Mono"]; S3 [label="S"]; S3 -> eps; eps [label="ε" fontcolor="#888" fontname="IBM Plex Mono"]; }

Reading the leaves left to right: \(0, 0, 0, \varepsilon, 1, 1, 1 = 000111\). The tree makes the nesting structure visible — each \(S\) node corresponds to one matched 0-1 pair.

Here's a more complex example from the lectures. Grammar \(G = (\{S, X, Y\}, \{a, b\}, R, S)\) with:

\[S \to X \mid Y \qquad X \to aXa \mid bYb \mid \varepsilon \qquad Y \to aXb \mid bYa \mid \varepsilon\]

Parse tree for \(w = abba\):

digraph { rankdir=TB; size="3,4"; node [shape=none fontname="IBM Plex Sans" fontsize=12 margin="0.05,0.02"]; edge [arrowhead=none]; nodesep=0.25; ranksep=0.3; S [label="S"]; X0 [label="X"]; S -> X0; X0 -> a0; X0 -> X1; X0 -> a1; a0 [label="a" fontcolor="#16a34a" fontname="IBM Plex Mono"]; a1 [label="a" fontcolor="#16a34a" fontname="IBM Plex Mono"]; X1 [label="X"]; X1 -> b0; X1 -> Y0; X1 -> b1; b0 [label="b" fontcolor="#2563eb" fontname="IBM Plex Mono"]; b1 [label="b" fontcolor="#2563eb" fontname="IBM Plex Mono"]; Y0 [label="Y"]; Y0 -> eps; eps [label="ε" fontcolor="#888" fontname="IBM Plex Mono"]; }

Yield: \(a, b, \varepsilon, b, a = abba\). The derivation was \(S \Rightarrow X \Rightarrow aXa \Rightarrow abYba \Rightarrow abba\).

Leftmost and Rightmost Derivations

When a string being derived contains multiple variables, you have a choice: which variable to replace next? Two canonical orderings:

Leftmost derivation: At every step, replace the leftmost variable.
Rightmost derivation: At every step, replace the rightmost variable.

Consider grammar \(G\) with rules: \(S \to aA \mid b\), \(A \to a \mid bB\), \(B \to aA \mid b\).

To derive the string \(ababababb\):

Leftmost derivation:

S ⇒ aA ⇒ abB ⇒ abaA ⇒ ababB ⇒ ababaA ⇒ abababB ⇒ abababaA ⇒ ababababB ⇒ ababababb

Different derivation orders can produce the same parse tree. A leftmost and rightmost derivation of the same string may look different as step sequences, but if they correspond to the same parse tree, they represent the same structural interpretation of the string.

Key insight: The parse tree captures structure. The derivation order (left-to-right vs. right-to-left) is just a traversal order of that same tree. This distinction matters for defining ambiguity — see next section.
Ambiguity

Sometimes a grammar can produce the same string through different parse trees. When this happens, the string has two different structural interpretations — a problem if your grammar is supposed to define the meaning of programs.

Definition (Sipser 2.7). A string \(w\) is derived ambiguously in grammar \(G\) if it has two or more different leftmost derivations (equivalently, two or more different parse trees). Grammar \(G\) is ambiguous if it generates some string ambiguously.

The classic example: arithmetic expressions

Consider the "flat" grammar \(G_5\):

\[\langle\text{EXPR}\rangle \to \langle\text{EXPR}\rangle + \langle\text{EXPR}\rangle \mid \langle\text{EXPR}\rangle \times \langle\text{EXPR}\rangle \mid (\langle\text{EXPR}\rangle) \mid \texttt{a}\]

The string a+a×a has two different parse trees:

Tree 1: (a+a)×a

digraph { rankdir=TB; size="2.5,3"; node [shape=none fontname="IBM Plex Sans" fontsize=11 margin="0.04,0.02"]; edge [arrowhead=none]; nodesep=0.2; ranksep=0.25; E0 [label="EXPR"]; E0 -> E1; E0 -> times; E0 -> E2; times [label="×" fontname="IBM Plex Mono"]; E1 [label="EXPR"]; E2 [label="EXPR"]; E1 -> E3; E1 -> plus; E1 -> E4; plus [label="+" fontname="IBM Plex Mono"]; E3 [label="EXPR"]; E4 [label="EXPR"]; E3 -> a1; E4 -> a2; E2 -> a3; a1 [label="a" fontcolor="#16a34a" fontname="IBM Plex Mono"]; a2 [label="a" fontcolor="#16a34a" fontname="IBM Plex Mono"]; a3 [label="a" fontcolor="#16a34a" fontname="IBM Plex Mono"]; }

Tree 2: a+(a×a)

digraph { rankdir=TB; size="2.5,3"; node [shape=none fontname="IBM Plex Sans" fontsize=11 margin="0.04,0.02"]; edge [arrowhead=none]; nodesep=0.2; ranksep=0.25; E0 [label="EXPR"]; E0 -> E1; E0 -> plus; E0 -> E2; plus [label="+" fontname="IBM Plex Mono"]; E1 [label="EXPR"]; E2 [label="EXPR"]; E1 -> a1; a1 [label="a" fontcolor="#16a34a" fontname="IBM Plex Mono"]; E2 -> E3; E2 -> times; E2 -> E4; times [label="×" fontname="IBM Plex Mono"]; E3 [label="EXPR"]; E4 [label="EXPR"]; E3 -> a2; E4 -> a3; a2 [label="a" fontcolor="#16a34a" fontname="IBM Plex Mono"]; a3 [label="a" fontcolor="#16a34a" fontname="IBM Plex Mono"]; }

Tree 1 evaluates as \((a + a) \times a\) — addition first. Tree 2 evaluates as \(a + (a \times a)\) — multiplication first. For a compiler, this ambiguity is a real bug: the program's meaning depends on which parse tree you pick.

Fixing ambiguity with grammar design

The unambiguous grammar \(G_4\) generates the same language but enforces standard precedence (\(\times\) before \(+\)):

\[\langle\text{EXPR}\rangle \to \langle\text{EXPR}\rangle + \langle\text{TERM}\rangle \mid \langle\text{TERM}\rangle\] \[\langle\text{TERM}\rangle \to \langle\text{TERM}\rangle \times \langle\text{FACTOR}\rangle \mid \langle\text{FACTOR}\rangle\] \[\langle\text{FACTOR}\rangle \to (\langle\text{EXPR}\rangle) \mid \texttt{a}\]

Now a+a×a has exactly one parse tree, and it groups \(\times\) tighter than \(+\). The trick: introduce a variable hierarchy (\(\text{EXPR} \to \text{TERM} \to \text{FACTOR}\)) where operators at each level can only combine operands from the level below.

Parse tree for a+a×a using the unambiguous grammar \(G_4\):

digraph { rankdir=TB; size="4,4.5"; node [shape=none fontname="IBM Plex Sans" fontsize=11 margin="0.04,0.02"]; edge [arrowhead=none]; nodesep=0.2; ranksep=0.25; E0 [label="EXPR"]; E0 -> E1; E0 -> plus; E0 -> T0; plus [label="+" fontname="IBM Plex Mono"]; E1 [label="EXPR"]; E1 -> T1; T1 [label="TERM"]; T1 -> F1; F1 [label="FACTOR"]; F1 -> a1; a1 [label="a" fontcolor="#16a34a" fontname="IBM Plex Mono"]; T0 [label="TERM"]; T0 -> T2; T0 -> times; T0 -> F3; times [label="×" fontname="IBM Plex Mono"]; T2 [label="TERM"]; T2 -> F2; F2 [label="FACTOR"]; F2 -> a2; a2 [label="a" fontcolor="#16a34a" fontname="IBM Plex Mono"]; F3 [label="FACTOR"]; F3 -> a3; a3 [label="a" fontcolor="#16a34a" fontname="IBM Plex Mono"]; }

Notice: \(\times\) sits deeper in the tree (under TERM), so it binds tighter. \(+\) sits at the top (EXPR level). Only one tree is possible — the hierarchy forces the grouping \(a + (a \times a)\).

The general structure — how the variable hierarchy enforces precedence:

EXPR
handles + — combines TERMs
EXPR → EXPR + TERM | TERM
↓ falls through to
TERM
handles × — combines FACTORs
TERM → TERM × FACTOR | FACTOR
↓ falls through to
FACTOR
terminals or ( ) to reset
FACTOR → ( EXPR ) | a
↓ reaches terminal a     or     ( ) jumps back to top ↑

Each level can only combine things from the level below. \(+\) combines TERMs, \(\times\) combines FACTORs. Parentheses ( ) are the escape hatch — they wrap an EXPR back into a FACTOR, resetting the precedence. This is how every programming language parser works.

Inherent ambiguity. Some context-free languages have the property that every grammar for them is ambiguous — no rewriting of the rules can fix it. Such languages are called inherently ambiguous. Example: \(\{a^i b^j c^k \mid i = j \text{ or } j = k\}\). Proving inherent ambiguity is hard — it's an advanced topic.
Chomsky Normal Form (CNF)

CFG rules can have any mix of variables and terminals on the right-hand side, making them hard to work with algorithmically. Chomsky Normal Form restricts rules to a very simple shape — and every CFG can be converted to it.

Definition (Sipser 2.8). A CFG is in Chomsky normal form if every rule is of the form: \[A \to BC \qquad \text{or} \qquad A \to a\] where \(A, B, C\) are variables (with \(B, C\) not the start variable), and \(a\) is a terminal. Additionally, the rule \(S \to \varepsilon\) is permitted if \(S\) is the start variable.

In other words: every rule either produces exactly two variables or exactly one terminal. Nothing else. No mixed rules like \(A \to aBc\), no long rules like \(A \to BCDE\), no \(\varepsilon\)-rules (except for the start variable).

Why CNF matters

Conversion to CNF (sketch)

Theorem (Sipser 2.9). Any context-free language is generated by a context-free grammar in Chomsky normal form.

The conversion proceeds in four stages:

  1. Add a new start variable. Create \(S_0 \to S\). This ensures the start variable doesn't appear on the right-hand side of any rule.
  2. Remove \(\varepsilon\)-rules. For each rule \(A \to \varepsilon\) (where \(A\) is not the start variable), find every rule that uses \(A\) on the right-hand side, and add versions with \(A\) deleted. E.g., if \(R \to uAv\) exists, add \(R \to uv\). Repeat until no \(\varepsilon\)-rules remain (except possibly \(S_0 \to \varepsilon\)).
  3. Remove unit rules. A unit rule is \(A \to B\) (one variable to one variable). If \(A \to B\) and \(B \to u\), add \(A \to u\) (unless it's a unit rule we already removed). Repeat until no unit rules remain.
  4. Convert remaining rules to proper form. For rules with right-hand sides of length \(\geq 3\), like \(A \to u_1 u_2 \cdots u_k\), introduce new variables to chain them into pairs: \(A \to u_1 A_1\), \(A_1 \to u_2 A_2\), ..., \(A_{k-2} \to u_{k-1} u_k\). For any terminal \(a\) that appears in a rule alongside variables, replace it with a fresh variable \(U_a\) and add \(U_a \to a\).
Worked example (Sipser 2.10). Convert this grammar to CNF:

\(S \to ASA \mid aB\),   \(A \to B \mid S\),   \(B \to b \mid \varepsilon\)

Step 1 — New start: add \(S_0 \to S\).
Step 2 — Remove \(B \to \varepsilon\): add versions of rules using \(B\) with \(B\) deleted. \(S \to aB\) gives \(S \to a\). \(A \to B\) gives \(A \to \varepsilon\). Now remove \(A \to \varepsilon\): \(S \to ASA\) gives \(S \to SA \mid AS \mid S\). Continue until all \(\varepsilon\)-rules eliminated.
Step 3 — Remove unit rules: \(S_0 \to S\), \(A \to B\), \(A \to S\), \(S \to S\). Replace each with the non-unit rules they lead to.
Step 4 — Fix long rules and mixed rules. Final result:

\(S_0 \to AA_1 \mid UB \mid a \mid SA \mid AS\)
\(S \to AA_1 \mid UB \mid a \mid SA \mid AS\)
\(A \to b \mid AA_1 \mid UB \mid a \mid SA \mid AS\)
\(A_1 \to SA\),   \(U \to a\),   \(B \to b\)
Worked Examples

Example 1: \(\{0^n1^n \mid n \geq 0\}\)

Grammar: \(S \to 0S1 \mid \varepsilon\)

This is the canonical example of a CFL that is not regular. The recursive rule \(S \to 0S1\) wraps matched pairs — it's the grammatical equivalent of push/pop on a stack. Each application adds one 0 to the left and one 1 to the right, maintaining the count invariant. The \(\varepsilon\)-rule is the base case.

Sample derivations:

Example 2: Balanced parentheses

Grammar: \(S \to (S) \mid SS \mid \varepsilon\)

This is Sipser's grammar \(G_3\) with ( instead of a and ) instead of b. The three rules handle:

Derivation of (()()):

S ⇒ (S) ⇒ (SS) ⇒ ((S)S) ⇒ (()S) ⇒ (()(S)) ⇒ (()())

Example 3: Arithmetic expressions

Grammar \(G_4\) (Sipser Example 2.4):

\[\langle\text{EXPR}\rangle \to \langle\text{EXPR}\rangle + \langle\text{TERM}\rangle \mid \langle\text{TERM}\rangle\] \[\langle\text{TERM}\rangle \to \langle\text{TERM}\rangle \times \langle\text{FACTOR}\rangle \mid \langle\text{FACTOR}\rangle\] \[\langle\text{FACTOR}\rangle \to (\langle\text{EXPR}\rangle) \mid \texttt{a}\]

This grammar captures operator precedence: \(\times\) binds tighter than \(+\) because \(\times\) is handled at the \(\text{TERM}\) level (closer to the leaves) while \(+\) is at the \(\text{EXPR}\) level (closer to the root).

Derivation of a+a×a:

EXPR ⇒ EXPR + TERM ⇒ TERM + TERM ⇒ FACTOR + TERM
     ⇒ a + TERM ⇒ a + TERM × FACTOR ⇒ a + FACTOR × FACTOR
     ⇒ a + a × FACTOR ⇒ a + a × a

This derivation produces a single parse tree that groups as \(a + (a \times a)\) — correct precedence.

Interactive: Step Through a Derivation

Grammar for \(\{0^n1^n \mid n \geq 0\}\): \(S \to 0S1 \mid \varepsilon\)

Watch the derivation build string \(0^n1^n\) step by step.

Interactive: CFG Membership Checker

Test whether a string belongs to \(L(G)\) for the balanced parentheses grammar: \(S \to (S) \mid SS \mid \varepsilon\)

Exercises

Exercise 1: Design a CFG

Give a context-free grammar for the language \(L = \{a^i b^j \mid i \neq j,\ i, j \geq 0\}\) over \(\Sigma = \{a, b\}\).

Hint: Split into two cases — more \(a\)'s than \(b\)'s, or more \(b\)'s than \(a\)'s. Use the union technique: design separate grammars and combine with \(S \to S_1 \mid S_2\).

Exercise 2: Identify the Language

What language does this grammar generate?

\(S \to aSa \mid bSb \mid a \mid b \mid \varepsilon\)

Exercise 3: Parse Tree from Rules

Given grammar \(G = (\{S, A, B\}, \{a, b\}, P, S)\) with rules:

\(S \to aA \mid b\),    \(A \to a \mid bB\),    \(B \to aA \mid b\)

Give a leftmost derivation for the string \(abababb\), and state how many rule applications are used.

Summary
Context-Free Grammars — the essentials: