Parsing

Introduction
Grammars
Parse tree
1. AST
First and follow sets
1. First sets
2. Follow sets
Top-down parsing
1. Recursive descent parsing
2. Predictive parsing
Bottom-up parsing
1. Shift-reduce parsing
  1. LR(0) parser
  2. SLR parser
References

Introduction

Parsing is the process of verifying that a sequence of tokens conforms to the rules of a formal grammar. It also involves constructing a (sometimes implicit) parse tree. A program that performs parsing is known as a parser.

Note: Some compilers combine lexing and parsing into a single phase.

There are three general types of parser covered in these notes:

Top-down
Bottom-up
Universal

Top-down parsers build parse trees from the root to the leaves, whereas bottom-up parsers build parse trees by starting at the leaves and ending at the root [1, P. 192].

Grammars

In the context of parsing, a grammar is a set of productions that describes how to form valid strings according to a formal language’s syntax.

Note: grammars can’t describe all aspects of a language, e.g. a context-free grammar can’t define that an identifier must be defined before it’s used [1, P. 209].

Terminals are the lexical elements used in specifying the rules for a grammar [1, P. 197].

Nonterminals are syntactic variables that denote sets of strings [1, P. 197].

A production is a rewrite rule specifying a symbol substitution that can be made.

A production head is a number of symbols (with at least one nonterminal). A production body is a sequence of terminals/nonterminals which can replace head.

A CFG (Context-Free Grammar) is a formal grammar where productions can be applied regardless of the context of a nonterminal. In other words, in a CFG a production head can only contain a single nonterminal.

Formally, a CFG $G$ consists of:

A finite set of nonterminals $N$ .
A finite set of terminals $T$ .
A start symbol $S$ ( $S \in N$ ).
A finite set of productions $P$ , where each production is in the form $X \rightarrow Y$ and $X \in N$ and $Y_i \in N \cup T \cup \{ \epsilon \}$ .

Every construct that can be described by a regular expression can be described by a CFG, but the inverse is not true [1, P. 205].

A language that can be generated by a CFG is a context-free language [1, P. 200].

If two grammars generate the same language then the grammars are said to be equivalent [1, P. 200].

In these notes, productions are written in the following form:

$\text{A} \rightarrow \text{if} \ ( \ \mathit{expr} \ ) \ \mathit{stmt} \ \text{else} \ \mathit{stmt}$

Other productions would then define the nonterminals $expr$ and $stmt$ , which are italicized.

A set of productions with a common head can be written using the vertical bar (|):

$A \rightarrow B \ \vert \ C \ \vert \ D$

Derivation

A derivation of a string for a grammar is a sequence of rule applications (productions) that transform the start symbol into a string.

A derivation can be represented as a tree (a parse tree).

A derivation proves that a string is an instance of an expression [1, P. 200].

The symbol $\overset{*} \Rightarrow$ means “derives in zero or more steps”.

The symbol $\overset{+} \Rightarrow$ means “derives in one or more steps”.

A sentential form is a string that is derivable from the start symbol. Formally, if $S \overset{*} \Rightarrow \alpha$ , where $S$ is the start symbol for a grammar $G$ , then $\alpha$ is a sentential form of $G$ .

A sentence of $G$ is a sentential form with no nonterminals. The language generated by a grammar is its set of sentences [1, P. 200].

In leftmost derivations, the leftmost nonterminal expression in each sentential is chosen (denoted $\alpha \overset{*}{ \underset{\text{lm}} \Rightarrow} \beta$ ).

In rightmost derivations (canonical derivations), the rightmost nonterminal is chosen (denoted $\alpha \overset{*}{ \underset{\text{rm}} \Rightarrow} \beta$ ).

At each step in a derivation, there are two decisions to be made:

Which nonterminal should be replaced.
Which production should replace it.

[1, P. 200]

Left recursion

A grammar is left-recursive if it has a nonterminal $A$ such that there is a derivation $A \overset{+} \Rightarrow A \alpha$ for some string $\alpha$ [1, P. 212].

Immediate left recursion is where there is a production $A \Rightarrow A \alpha$ .

Top-down parsing cannot handle left-recursive grammars. In this case the recursion must be eliminated with the rule:

$A \rightarrow A \alpha \ \vert \ \beta$ becomes $$

[1, P. 212]

Left factoring

Left factoring is the process of rewriting productions that share a common prefix, it can be done to make a grammar suitable to predictive parsing [1, P. 214].

For example, $A \rightarrow \alpha \beta_1 \ \vert \ \alpha \beta_2$ becomes:

$A' => \beta_1 \ \vert \ \beta_2$

[1, P. 214]

Ambiguity

An ambiguous grammar produces more than one parse tree for a given sentence [1, P. 202].

Ambiguous grammars are bad since they leave the meaning of the program undefined.

One way to remove ambiguity is to rewrite grammar unambiguously. Another approach is to use disambiguating declarations. The most popular forms are associativity declarations and precedence declarations, e.g. to declare left associativity in bison: %left +.

LL grammar

An LL(k) grammar is a grammar that can be parsed by an LL parser with k tokens of lookahead. e.g. LL(1) grammar can be parsed by an LL parser with 1 token lookahead.

An LL parser is a top-down parser that parses from left to right, constructing a left-most derivation.

A grammar is guaranteed not to be LL(1) if it is any of:

Not left factored
Left recursive
Ambiguous

Even if a grammar does not meet one of these criteria, it is still not guaranteed to be LL(1).

LR grammar

An LR(k) grammar is a grammar that can be parsed by an LR parser with k tokens of lookahead.

An LR parser is a top-down parser that parses from right to left, constructing a right-most derivation.

BNF

BNF (Backus-Naur Form) is a syntax for context-free grammars.

Example:

<addition> ::= <number>+<number>
<number> ::= <sign><integer>|<integer>
<integer> ::= <digit>|<digit><integer>
<digit>::=0|1|2|3|4|5|6|7|8|9
<sign> ::= +|-

Parse tree

A parse tree is a graphical representation of a derivation in which each non-leaf node represents the application of a production [1, P. 201].

The leaves of a parse tree are labelled by nonterminals or terminals [1, P. 201].

Parse trees ignore the order in which symbols in sentential form are replaced [1, P. 202].

AST

An AST (Abstract Syntax Tree) is a tree representation of the syntactic structure of a program.

ASTs are similar to parse trees except that an AST’s interior nodes represent programming constructs, whereas a parse tree’s interior nodes represent nonterminals [1, P. 69].

Parse trees are sometimes called concrete syntax trees to distinguish them from ASTs.

An AST is a key data structure in a compiler.

First and follow sets

First and follow sets are used to help construct top-down and bottom-up parsers.

First sets

To compute $First(X)$ for all grammar symbols $X$ :

If $X$ is terminal, $First(X) = \{X\}$ .
If $X$ is nonterminal and $X \rightarrow Y_1, Y_2, ..., Y_k$ is a production for $k >= 1$ , add $a$ to $First(X)$ if for some $i$ , $a \in First(Y_i)$ and $First(Y_1) ... First(Y_{i-1}) \overset{*} \Rightarrow \epsilon$ . If $\epsilon \in First(Y_j)$ for $j = 1, 2,... , k$ then add $\epsilon$ to $First(X)$ .
If $X \rightarrow \epsilon$ is a production then add $\epsilon$ to $First(X)$ .

[1, P. 221]

Follow sets

$Follow(A)$ for a nonterminal $A$ is the set of terminals that can appear immediately to the right of $A$ in some sentential form.

To compute $Follow(A)$ for all nonterminals $A$ :

Place $ in $Follow(S)$ (where $S$ is start symbol and $ is the input right end marker).
If there is a production $A \rightarrow \alpha B \beta$ , add everything in $First(\beta)$ to $Follow(B)$ (except $\epsilon$ ).
If there is a production $A \rightarrow \alpha B$ , or a production $A \rightarrow \alpha B \beta$ where $\epsilon \in First(\beta)$ , add everything in $Follow(A)$ to $Follow(B)$ .

[1, Pp. 221-2]

Top-down parsing

Top-down parsing is the process of producing a parse tree from an input sequence starting from the root by following the rewriting rules of a grammar [1, P. 217].

At each step of a top-down algorithm, you must determine the production to be applied for a nonterminal. Once a production has been chosen, the rest of the parsing involves matching the production body with the input string [1, P. 217].

Recursive descent parsing

A recursive descent parser is a top-down LL parser.

A recursive descent parser implements a procedure for each nonterminal. See Recursive descent parser Wiki article for full details.

A left-recursive grammar can cause a recursive descent parser to enter an infinite loop [1, P. 220].

Predictive parsing

Predictive parsing is a special case of recursive descent parsing that doesn’t require backtracking [1, P. 222].

Predictive parsers work on LL(1) grammars since the proper production to apply for a nonterminal can be determined by looking at only the current input symbol [1, P. 222].

Predictive parsing can be implemented using a predictive parsing table. A predictive parsing table $T[A, \alpha]$ is a 2D array where $A$ is a nonterminal, $\alpha$ is a terminal (or the special end symbol $), and an entry is a production.

An LL(1) grammar will produce a parsing table where each entry uniquely identifies a production [1, P. 225].

A table can be constructed for CFG $G$ with the following rules:

For each production in :
- For each terminal $a \in First(\alpha)$ : $T[A, a] = A \rightarrow \alpha$
- If $epsilon \in First(\alpha)$ , for each terminal $b \in Follow(A)$ : $T[A, b] = A \rightarrow \alpha$ . If $\epsilon \in First(\alpha)$ and $\$ \in Follow(A)$ do: $T[A, \$] = A \rightarrow \alpha$

[1, P. 224]

The following pseudocode implements a table-driven predictive parser:

let a be the first symbol of w;
let X be the top of the stack symbol
while X != $:
  if X == a:
    pop the stack and let a be the next symbol of w
  else if X is a terminal:
    error()
  else if T[X, a] is an error entry:
    error()
  else if T[X, a] = X -> Y_1, Y_2, ..., Y_n:
    output production X -> Y_1, Y_2, ..., Y_n
    pop the stack
    push Y_n, ..., Y_2, Y_1 onto the stack with Y_1 on top
  let X be the top stack symbol

[1, P. 227]

Bottom-up parsing

Bottom-up parsing reduces a string to the start symbol by inverting productions (begin with input token sequence, end with start symbol).

Most parser generator tools use bottom-up parsing.

Shift-reduce parsing

Shift-reduce parsing is a form of bottom-up parsing.

In shift-reduce parsing, a string is split into two substrings. The left substring contains terminals and nonterminals, and the right substring is unexamined. The division can be represented by a bar (|).

There are two primary moves: shift and reduce. Shift reads one token of input. Reduce applies a reduction to the right end of the left string.

A reduction is the inverse of a derivation step. It replaces input sequences with productions. A complete reduction sequence produces a rightmost derivation when reversed [1, P. 235].

A shift-reduce parser is normally implemented with a stack, where shift pushes a terminal onto the stack and reduce pops symbols from the stack and pushes a nonterminal onto the stack.

A handle is a substring that matches the body of a production, and whose reduction allows further reductions back to the start symbol. Handles only appear on the top of the stack, never inside [1, Pp. 235-7].

An item is a production with a dot (.) somewhere in the production body that indicates how much of a production has been seen at given point in the parsing process [1, P. 242].

The items (also known as LR(0) items) for $T \rightarrow (E)$ are:

$T \rightarrow .(E)$
$T \rightarrow (.E)$
$T \rightarrow (E.)$
$T \rightarrow (E).$

The production $A \rightarrow \epsilon$ generates an item $A \rightarrow .$ [1, P. 242].

A viable prefix is a prefix of a right-sentential form that does not continue past the right end of the rightmost handle of that sentential form. As long as a viable prefix is on the stack, no parsing error has been detected [1, P. 256].

For any grammar, the set of viable prefixes is a regular language and therefore can be validated by a finite automata. You can build an NFA that takes the stack as its input and either accepts or rejects the stack. The NFA can then be converted to a DFA using the powerset construction.

The rules to construct an NFA for recognizing viable prefixes VPs:

Add a dummy production $S' \rightarrow S$ to $G$ .
NFA states are the items of $G$ (including the extra production).
For item $E \rightarrow \alpha . X \beta$ , add transition $E \rightarrow alpha . X \beta \overset{X} \rightarrow E \rightarrow \alpha X . \beta$ ( $X$ is nonterminal or terminal).
For item $E \rightarrow \alpha . X \beta$ and production $X \rightarrow \gamma$ ( $X$ is nonterminal) add transition $E \rightarrow \alpha . X \beta \overset{\epsilon} \rightarrow X \rightarrow . \gamma$
Every state is an accepting state.
Start state is $S' \rightarrow .S$ .

When the NFA has been converted into a DFA, each state is a set of items representing the possible current state of the automaton. These states are known as the canonical collections of items.

Item $X \rightarrow \beta . \gamma$ is valid for viable prefix $\alpha \beta$ if $S' \overset{*} \Rightarrow \alpha X \omega \Rightarrow \alpha \beta \gamma \omega$ .

A conflict occurs if there are multiple actions leading to a valid parse at a given step. A shift-reduce conflict is when a grammar allows both a shift and a reduce for the same item. A reduce-reduce conflict is when there are two or more possible reductions.

LR(0) parser

An LR(0) parser is a shift-reduce parser that uses zero tokens of lookahead to determine what action to take.

Let $s$ be the terminating state when running $DFA(stack)$ and $t$ be the next input token.

At each stage, an LR(0) parser will reduce by $X \rightarrow \beta$ if $s$ contains item $X \rightarrow \beta .$ . An LR(0) parser will reduce if $s$ contains item $X \rightarrow \beta. t \omega$ .

Since there is zero lookahead, in any configuration of the parser there must be an unambiguous action that it can take.

SLR parser

An SLR parser is a type of LR parser.

An SLR parser improves on LR(0) with the modification that a reduction $X \rightarrow \beta$ is only made if $s$ contains $X \rightarrow \beta.$ and $t \in Follow(X)$ .

SLR parsing algorithm:

Let $M$ be DFA for viable prefixes of $G$ .
Let $\vert x_1, ..., x_n \$$ be the initial configuration.
Repeat until configuration is
1. Let $\alpha \vert \omega$ be current configuration.
2. Run $M$ on current stack $\alpha$ .
3. If $M$ rejects $\alpha$ , report parsing error.
4. If accepts with items , let be next input.
  1. Shift if $X \rightarrow \beta . a \gamma \in I$ .
  2. Reduce if $X \rightarrow \beta . \in I$ and $a \in Follow(X)$ .
  3. Report parsing error if neither applies.

In the case of conflicts when following this algorithm, the grammar is not an SLR grammar.

References

[1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools (2nd Edition). USA: Addison-Wesley Longman Publishing Co., Inc., 2006.