# Lexical analysis

## Table of contents

## Introduction

Lexical analysis (lexing) is the process of converting a sequence of characters into a sequence of tokens. A program that performs lexing is known as a lexer. Lexical analysis is normally the first phase of a compiler.

A **token** consists of a token name and an optional value, e.g. <identifier, x> [1, P. 111].

A **lexeme** is a sequence of characters in the input source that matches the pattern for a token and is identified as an instance of that token [1, P. 111].

A **pattern** is a description of the form that lexemes can take.

Common token names include:

- Keywords
- Operators
- Identifiers
- Literals
- Separators

Usually a separate symbol table is maintained by the lexer which contains additional information about identifiers (such as type, line location, etc.) [1, P. 112].

## Formal languages

A formal language is a set of strings drawn from an **alphabet** (usually represented as ), which is a finite set of characters [1, P. 117].

A meaning function maps syntax to semantics. The mapping is many-to-one, which enables optimizations without changing the semantics of the language.

The empty string (denoted ) is a string of length 0.

## Regular expressions

A regular expression is a sequence of characters that defines a search pattern. Regular expressions can be used for specifying lexeme patterns [1, P. 116].

Regular expressions consist of constants, which denote sets of strings, and operator symbols, which denote operations.

Common operations include:

- Alternation ()
- Concatenation ()
- Kleene closure ()
- Positive closure()

A **regular set** is a language that can be defined by a regular expression [1, P. 122].

## Lexical grammar

A **lexical grammar** is a set of rules defining the syntax of tokens.

In the case that a substring matches multiple patterns, the following conventions are followed:

- Maximal match: if there are two possible substrings, take the longer one.
- Rule priority: Regular expressions identifying tokens are written down in sequence. If two regular expressions match the longest string, the first regular expression in the sequence takes precedence.

## Finite automata

Finite automata (or finite state machines) are abstract machines that can be in exactly one state at a time. They can be used to determine matches for a regular expression.

Finite automata can be represented as diagrams of nodes, where each node represents a state.

They can also be represented with a transition table, where the rows correspond to a state and the columns correspond to the input symbols and their associated next state [1, P. 148].

A transition between two states is known as a move.

The input pointer only advances in finite automata, although in an epsilon move a transition can occur without advancing the input pointer.

A finite automata can be defined to accept a string iff there is a path in the transition graph from the start state to a final state so that the symbols along the path spell out [1, P. 149].

There are two types of finite automata: nondeterministic and deterministic.

NFAs (Nondeterministic Finite Automata) have no restrictions on their edges. A symbol label can have several edges, and the empty string epsilon is a supported edge label [1, P. 147].

NFAs consists of:

- A finite set of states .
- An input alphabet .
- A transition function that gives for each state and each symbol in a set of next states.
- A start state in .
- A set of states in which are final states.

A DFA (Deterministic Finite Automata) is a special case of an NFA with the following conditions:

- There are no moves for an empty string input.
- There is only one edge per input symbol for a given state .

A regex pattern can be converted to an NFA. An NFA can then be converted to a DFA, which is simpler to implement and computationally less expensive to run (although it often requires more space) [1, P. 150].

## References

- [1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman,
*Compilers: Principles, Techniques, and Tools (2nd Edition)*. USA: Addison-Wesley Longman Publishing Co., Inc., 2006.