01 — Alphabets, Strings & Languages
The three building blocks everything else is made from
Why This Matters

Before you can talk about machines that recognize patterns, you need a precise vocabulary for what they're recognizing. In software engineering terms: you wouldn't write a parser without defining your token types first. Alphabets, strings, and languages are the "type system" of computation theory — every automaton, every regex, every grammar operates on these three primitives.

Get these definitions locked in and the rest of the course is just combining them in increasingly clever ways. Skip them and everything downstream is built on vague intuition.

Alphabet (\(\Sigma\))
Definition. An alphabet \(\Sigma\) is any finite, non-empty set of symbols.

That's it. A finite set. The symbols are atomic — they have no internal structure as far as the theory cares.

Examples:

Key point: \(\Sigma\) must be finite. You can't have an alphabet with infinitely many symbols. But there's no upper limit on how many — just has to be a fixed, finite set.
Strings (Words)
Definition. A string (or word) over \(\Sigma\) is a finite sequence of symbols from \(\Sigma\).

Think of a string as an array of characters. The terms "string" and "word" are used interchangeably in automata theory.

Over \(\Sigma = \{0, 1\}\):

The Empty String \(\varepsilon\)

Definition. \(\varepsilon\) (epsilon) is the unique string of length 0. It contains no symbols. For any string \(w\): \(w\varepsilon = \varepsilon w = w\).

Software analogy: \(\varepsilon\) is like the empty string "" in programming. Concatenating it changes nothing. It's the identity element for string concatenation, just like 0 is the identity for addition.

Warning: \(\varepsilon\) is not a symbol. It's never in \(\Sigma\). It's a string — specifically the string with nothing in it. Don't confuse \(\varepsilon\) (empty string) with \(\emptyset\) (empty set/language). They are completely different objects.

String Length \(|w|\)

Definition. The length of a string \(w\), written \(|w|\), is the number of symbols in it.
String Operations

Concatenation

If \(x = \text{abc}\) and \(y = \text{de}\), then \(xy = \text{abcde}\). Just stick them together. Concatenation is associative (\((xy)z = x(yz)\)) but not commutative (\(xy \neq yx\) in general).

\(w^k\) means \(w\) concatenated with itself \(k\) times. So \(w^0 = \varepsilon\), \(w^1 = w\), \(w^2 = ww\), etc.

Reversal \(w^R\)

Definition. The reversal of \(w = w_1 w_2 \cdots w_n\) is \(w^R = w_n w_{n-1} \cdots w_1\).
\(\Sigma^*\) and \(\Sigma^n\)
Definition. \(\Sigma^n\) is the set of all strings over \(\Sigma\) of length exactly \(n\).

\(\Sigma^* = \Sigma^0 \cup \Sigma^1 \cup \Sigma^2 \cup \cdots = \bigcup_{n=0}^{\infty} \Sigma^n\)

\(\Sigma^*\) is the set of all strings over \(\Sigma\), including \(\varepsilon\).

For \(\Sigma = \{0, 1\}\):

Pattern: \(|\Sigma^n| = |\Sigma|^n\). For binary, \(|\Sigma^n| = 2^n\). This is just the number of ways to fill \(n\) slots with \(|\Sigma|\) choices each.

\(\Sigma^*\) is countably infinite — you can list all strings in shortlex order (by length, then lexicographic within each length). This is important later when we show that some languages can't be recognized by any machine.

Also useful: \(\Sigma^+ = \Sigma^* \setminus \{\varepsilon\}\) — all non-empty strings.

Language
Definition. A language \(L\) over \(\Sigma\) is any subset of \(\Sigma^*\). That is, \(L \subseteq \Sigma^*\).

This is the most important definition on this page. A language is just a set of strings. Any set. It can be finite, infinite, empty, or all of \(\Sigma^*\).

Software analogy: a language is like a validation function isValid(s: string): boolean. The language is the set of all strings for which the function returns true. Automata theory asks: what kinds of validation functions can different machine models implement?

Example Languages over \(\Sigma = \{0, 1\}\)

LanguageDescriptionExample members
\(\emptyset\)The empty language — no strings at all(none)
\(\{\varepsilon\}\)Language containing only the empty string\(\varepsilon\)
\(\{0^n 1^n \mid n \geq 0\}\)Equal-length blocks of 0s then 1s\(\varepsilon, 01, 0011, 000111\)
Strings with even # of 1sEvery string where \(\#_1(w)\) is even\(\varepsilon, 0, 00, 11, 010, 0110\)
\(\Sigma^*\)All binary strings (everything is accepted)everything
Crucial distinction:
\(\emptyset \neq \{\varepsilon\}\). The empty language has zero strings in it. The language \(\{\varepsilon\}\) has one string in it (the empty string). It's like the difference between an empty array [] and an array containing an empty string [""].

Practice: Enumerate \(\Sigma^n\)

Pick an alphabet and a length. See all strings of that length.

   

Practice: Language Membership

Enter a binary string. Check if it belongs to common languages over \(\{0,1\}\).

Summary
The three primitives:
  1. Alphabet \(\Sigma\) — finite set of symbols
  2. String — finite sequence of symbols from \(\Sigma\); \(\varepsilon\) = empty string
  3. Language — any subset of \(\Sigma^*\) (the set of all strings)

Everything in this course — DFAs, NFAs, regular expressions, context-free grammars, Turing machines — is a different way of describing or recognizing languages.