Lexical analysis

Introduction
Formal languages
Regular expressions
Lexical grammar
Finite automata
References

Introduction

Lexical analysis (lexing) is the process of converting a sequence of characters into a sequence of tokens. A program that performs lexing is known as a lexer. Lexical analysis is normally the first phase of a compiler.

A token consists of a token name and an optional value, e.g. <identifier, x> [1, P. 111].

A lexeme is a sequence of characters in the input source that matches the pattern for a token and is identified as an instance of that token [1, P. 111].

A pattern is a description of the form that lexemes can take.

Common token names include:

Keywords
Operators
Identifiers
Literals
Separators

Usually a separate symbol table is maintained by the lexer which contains additional information about identifiers (such as type, line location, etc.) [1, P. 112].

Formal languages

A formal language is a set of strings drawn from an alphabet (usually represented as $Σ$ ), which is a finite set of characters [1, P. 117].

A meaning function $L$ maps syntax to semantics. The $L$ mapping is many-to-one, which enables optimizations without changing the semantics of the language.

The empty string (denoted $ϵ$ ) is a string of length 0.

Regular expressions

A regular expression is a sequence of characters that defines a search pattern. Regular expressions can be used for specifying lexeme patterns [1, P. 116].

Regular expressions consist of constants, which denote sets of strings, and operator symbols, which denote operations.

Common operations include:

Alternation ( $A ∣ B = {c ∣ c \in A \lor c \in B}$ )
Concatenation ( $A B = {c d ∣ c \in A \land d \in B}$ )
Kleene closure ( $A^{*} = \cup_{i = 0}^{\infty} A^{i}$ )
Positive closure( $A^{+} = \cup_{i = 1}^{\infty} A^{i}$ )

A regular set is a language that can be defined by a regular expression [1, P. 122].

Lexical grammar

A lexical grammar is a set of rules defining the syntax of tokens.

In the case that a substring matches multiple patterns, the following conventions are followed:

Maximal match: if there are two possible substrings, take the longer one.
Rule priority: Regular expressions identifying tokens are written down in sequence. If two regular expressions match the longest string, the first regular expression in the sequence takes precedence.

Finite automata

Finite automata (or finite state machines) are abstract machines that can be in exactly one state at a time. They can be used to determine matches for a regular expression.

Finite automata can be represented as diagrams of nodes, where each node represents a state.

Figure: Finite automata that accepts 1 or 70

They can also be represented with a transition table, where the rows correspond to a state and the columns correspond to the input symbols and their associated next state [1, P. 148].

A transition between two states is known as a move.

The input pointer only advances in finite automata, although in an epsilon move a transition can occur without advancing the input pointer.

A finite automata can be defined to accept a string $s$ iff there is a path in the transition graph from the start state to a final state so that the symbols along the path spell out $s$ [1, P. 149].

There are two types of finite automata: nondeterministic and deterministic.

NFAs (Nondeterministic Finite Automata) have no restrictions on their edges. A symbol label can have several edges, and the empty string epsilon is a supported edge label [1, P. 147].

NFAs consists of:

A finite set of states $S$ .
An input alphabet $Σ$ .
A transition function that gives for each state and each symbol in $Σ \cup ϵ$ a set of next states.
A start state $s_{0}$ in $S$ .
A set of states $F$ in $S$ which are final states.

[1, P. 147]

A DFA (Deterministic Finite Automata) is a special case of an NFA with the following conditions:

There are no moves for an empty string input.
There is only one edge per input symbol for a given state .

[1, P. 151]

A regex pattern can be converted to an NFA. An NFA can then be converted to a DFA, which is simpler to implement and computationally less expensive to run (although it often requires more space) [1, P. 150].

References

[1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools (2nd Edition). USA: Addison-Wesley Longman Publishing Co., Inc., 2006.