Semantic analysis

Introduction
Scope
Symbol tables
Types
References

Introduction

Semantic analysis is a compiler process which validates that source code is semantically consistent with the language definition. An example is type checking. It also often includes gathering additional information for future phases (e.g. type information) [1, P. 8].

Examples of validations made during semantic analysis:

All identifiers are declared before usage
Type checking
Inheritance relationships are valid
Methods in a class are defined only once

Semantic analysis is the last phase of the compiler frontend.

Most semantic analyses can be implemented as a recursive descent of an AST.

Scope

The scope of a name binding is the part of a program where the name binding is valid (where the name can be used to refer to the entity). The same name may refer to different entities in different parts of the program.

Name bindings can have restricted scope, e.g. in C, where block scope restricts scope to a subset of a function.

Lexical scope (aka static scope) is where the scope only depends on the position of the identifier in the source text—the scope isn’t based on run-time behavior. Most programming languages use static scope.

Dynamic scope is where the scope of an identifier depends on the execution of a program (e.g. the closest binding in the execution of the language). Lisp used to be dynamically scoped.

The “most closely nested” rule is where an identifier refers to the definition in the closest enclosing scope, such that the declaration precedes the use. C++ uses the “most closely nested rule”.

Symbol tables

A symbol table is a data structure that tracks the current bindings of identifiers.

When performing semantic analysis on a portion of the AST, the defined identifiers must be known.

Types

A type is an attribute of data that defines the operations that can be performed on the data, the values that the data can take, and the way the data is stored.

The three main benefits of types in a compiler:

Safety
Expressiveness
Run-time efficiency

[2, P. 165]

Statically-typed languages are typechecked during compilation (e.g. C, Java). Dynamically-typed languages are typechecked at run-time (e.g. JS, Python).

Most statically-typed languages have escape mechanisms to circumvent the type system, like unsafe casts in C and Java.

Implicit type conversion is where a value of type $T$ is coerced into an expected type $E$ when $T$ is an invalid type for the operation being performed on it. A strongly-typed language typically doesn’t perform implicit type conversions, whereas a weakly-typed language does perform implicit type conversions. e.g. '1' + 1 throws an error in strongly-typed Python, and evaluates to '11' in weakly-typed JS.

A type signature defines the types of the parameters and the return value of a function or method.

Type inference is where the compiler automatically detects the type of an expression. For example, a variable could be declared without a type annotation and the compiler could infer the type at compile-time (e.g. var in C#).

A sound type system has the property that if a variable is declared with a particular type, then it will have that type at run-time. A sound type system has the ability to catch every possible bug that might happen at run-time.

A complete type system has the property that it will only ever catch bugs that will happen at run-time. This comes at the cost of sometimes missing errors that will happen at run-time.

Type checking

Type checking is the process of verifying and enforcing type constraints. Static type checking is done at compile-time as part of semantic analysis.

Type checking can be implemented as a post-order tree walk, where each leaf node has a known type and each non-leaf node’s type can be inferred from the types of its children.

Pseudo-code for typechecking an expression:

def type_check(environment, node):
  if type(node) is AddExpressionNode:
    return type_check_add_expr(environment, node.e1, node.e2)
  ## .. case for each node

def typecheck_add_expr(environment, e1, e2):
  t1 = type_check(environment, e1)
  if not type(t1) == TInt:
    raise TypeCheckError('expected int')
  t2 = type_check(environment, e2)
  if not type(t2) == TInt:
    raise TypeCheckError('expected int')
  return TInt

Type rules

A type rule is an inference rule that describes how a type system assigns a type to a syntactic construct. Type rules can be applied by a type system to verify that a program is well-typed and to determine the type of each expression.

An expression $e$ of type $τ$ is written as $e : τ$ . The type environment is written as $Γ$ .

The notation for inference is the same as for inference rules. In general:

\frac{Γ_{1} ⊢ e_{1} : τ_{1} \dots Γ_{n} ⊢ e_{n} : τ_{n}}{Γ ⊢ e : τ}

The sequents above the line are premises that must be fulfilled in order for the rule to be applied, yielding the conclusion (the bottom sequents below the line). The turnstyle ( $⊢$ ) is read as “it is provable that …”.

An example rule:

\frac{⊢ e_{2} : I n t ⊢ e_{2} : I n t}{⊢ e_{1} + e_{2} : I n t}

Type environment

A type environment is a function that maps identifiers to types, giving types for free variables in an expression.

When type checking, the environment is usually passed down the AST from the root towards the leaves.

Let environment $Γ$ be a function mapping identifiers to types. $Γ ⊢ e : T$ , is read “under the assumption that free variables have the type given by $Γ$ , it is provable that the expression $e$ has type $T$ .

$Γ [T / x]$ is a function that applied to $x$ , returns $T$ .

In some languages method names and identifiers exist in different namespaces, therefore you can have both a method and a variable foo. This is implemented by using different environments (e.g. one for identifiers, and one for method names).

Subtyping

Subtyping is a form of type polymorphism where a subtype is related to another datatype (the supertype) by some notion of substitutability.

Figure: Subtyping hierarchy

If $Y$ is a subtype of $X$ , the subtyping relation is written $Y \leq X$ .

In OO, subclasses can only add methods or override methods with the same type signature.

Variance refers to how type constructs (like function return types) use subtyping relations. An example is covariance, which is commonly used for function return type. Covariance of a return type $X$ would allow any subtype $S$ (so that $S \leq X$ ) to be used in place of type $X$ .

Variance	Supertype allowed	Subtype allowed
Invariance	No	No
Covariance	No	Yes
Contravariance	Yes	No
Bivariance	Yes	Yes

Least upper bounds

In pure OO languages, the Least upper bound (LUB) of two types $S$ and $T$ is their lowest common ancestor in the hierarchy tree.

In language where conditional expressions evaluate to a value, the type of an expression would be $L U B (T_{1}, . . ., T_{N})$ , where $T_{1}, . . ., T_{N}$ are the types corresponding to each consequent expression.

References

[1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools (2nd Edition). USA: Addison-Wesley Longman Publishing Co., Inc., 2006.
[2] K. D. Cooper and T. Linda, Engineering a Compiler, 2nd ed. Morgan Kaufmann Publishers, 2012.