Coordination and time

Introduction
Clock synchronization
1. NTP
2. The Berkley algorithm
History
1. Serializability
Logical clocks
1. Lamport logical clocks
2. Vector clocks
Mutual exclusion
Election algorithms
1. The bully algorithm
References

Introduction

Coordination between processes can be required for several reasons:

To ensure two sets of data are the same (data synchronization).
To ensure one process waits for another to finish its operation (process synchronization).
To manage the interactions and dependencies between activities in a distributed system (coordination).

Clock synchronization

Clock synchronization is the process of coordinating independent clocks.

Clock synchronization in a distributed system is difficult since processes running on different machines will have different hardware clocks and hardware clocks suffer from clock drift. An average quartz-based hardware clock has a clock drift rate of around 31.5 seconds per year [1, Pp. 298, 303].

Clock synchronization is needed to overcome the problem of clock drift in order to provide a holistic view of a system, for example when reading logs from multiple processes.

Two important timekeeping concepts are precision and accuracy. Precision refers to the deviation between two clocks. Accuracy refers to the precision in reference to an external value, like UTC [1, P. 303].

Internal clock synchronization is the process of keeping clocks precise, whereas external clock synchronization is the process of making clocks accurate [1, P. 303].

UTC is the primary time standard used to regulate time. There are shortwave radio stations around the world that broadcast a pulse at the start of each UTC second, as well as satellites that offer a UTC service [1, P. 302].

A UTC receiver is hardware that can receive UTC from providers such as satellites. Other computers can sync their clocks to a computer equipped with a receiver in order to follow UTC [1, P. 302].

When a computer’s clock is greater than the actual time, the clock must be readjusted. Setting the clock to the lower correct time can lead to problems for programs that rely on an ever-increasing time value. To solve this, the time change can be introduced gradually. One way to do this is to reduce the number of milliseconds that the computer time is incremented by on each tick until the local time has caught up to the time the computer is synchronizing with.

There are many algorithms to synchronize several computers to one machine with a UTC receiver or for keeping a set of machines as synchronized as possible.

NTP

NTP (Network Time Protocol) is a protocol for synchronizing clocks over a network [1, P. 304].

In NTP, the clock offset and roundtrip delay between two machines is estimated by sending messages between them. The offset can then be used to adjust the local time [1, P. 305].

NTP divides servers into strata. A server with a reference clock is known as a stratum-1 server. When server $A$ contacts server $B$ , it will adjust its time iff $s t r a t u m_{l} e v e l (A) > s t r a t u m_{l} e v e l (B)$ . Once $A$ has synchronized, $A$ will change its stratum level to $s t r a t u m_{l} e v e l (B) + 1$ [1, P. 306].

The Berkley algorithm

The Berkley algorithm is an internal clock synchronization algorithm. It works by having a time daemon poll other machines periodically to get their respective times. The daemon then computes an average time using the answers and tells other machines to advance/slow their clocks to the new time [1, P. 306].

The Berkley algorithm works well for systems that don’t have a UTC receiver where internal clock synchronization is adequate [1, P. 306].

History

A history is a sequence of events that occur in a distributed system.

Informally, a history is said to be totally ordered if each event occurs one after the other. A total order can be created by ordering events according to the time each event occurred. In the case of multiple events occurring at the same time (concurrent events), the tie can be broken arbitrarily (e.g. by comparing the process ID of the event’s originator) [2].

Informally, a history is said to be partially ordered if some events are not ordered relative to each other.

Figure: Total and partial ordering

Note: For the formal definition of total ordering and partial ordering, see the total order Wikipedia page and the partial order Wikipedia page.

Serializability

A history is defined as serializable if the outcome of executing transactions (or operations) in the history is the same as if the transactions were executed sequentially one after the other [3, P. 631].

Logical clocks

Logical clocks are abstract clocks that have no relation to physical time. They can be used in distributed systems to create an ordering of events [2, P. 559].

Lamport logical clocks

Leslie Lamport introduced a logical clock system in the paper Time, Clocks, and the Ordering of Events in a Distributed System [2].

Lamport’s clock system can be used to create a totally ordered history in a system that doesn’t have a precise physical time available to all processes [2, P. 559].

In the paper, Lamport introduces the happened-before relation (denoted: $\to$ ). $a \to b$ means event $a$ occurred before event $b$ .

Using Lamport’s assumptions, there are only three ways to determine that event $a \to b$ :

If $a$ and $b$ are events running in the same process and $a$ occurs before $b$ .
If $a$ is the event of a message being sent by a process and $b$ is the event of the same message being received by a different process.
If $a \to b$ and $b \to c$ then $a \to c$ (happened-before is a transitive relation).

A Lamport clock system can be used to assign a number to each event. Each process $P_{i}$ has an event counter $C_{i}$ where $C_{i} (a)$ assigns a number to event $a$ in that process. Globally, $C (b) = C_{j} (b)$ if $b$ is an event that occurred in $P_{j}$ [2, P. 559].

The Clock Condition states that if $a \to b$ then $C (a) < C (b)$ . The Clock Condition is satisfied as long as two conditions hold:

C1: If $a$ and $b$ are events in process $P_{i}$ and $a$ occurs before $b$ then $C_{i} (a) < C_{i} (b)$ .
C2: If $a$ is the sending of a message by process $P_{i}$ and $b$ is receipt of the message by process $P_{j}$ , then $C_{i} (a) < C_{j} (b)$ .

[2, P. 560]

This can be implemented with the following rules:

IR1: Each process $P_{i}$ increments $C_{i}$ between any two successive events.
IR2:
- a) If event $a$ is the sending of message $m$ by process $P_{i}$ , $m$ contains a timestamp $T_{m} = C_{i} (a)$
- b) On receiving a message $m$ , process $P_{j}$ sets $C_{j}$ to $m a x (T_{m} + 1, C_{j})$

[2, P. 560]

The clock system creates a partially ordered history because two events can have the same event counter value. To create a totally ordered history, a tie between events with the same counter value can be broken arbitrarily by assigning a numeric value to a process’ ID and using the ID value to compare the conflicting events [2, P. 561].

Vector clocks

A vector clock is a data structure for determining the partial ordering of events in a distributed system.

Vector clocks extend on logical clocks to capture causality [4, P. 57].

With vector clocks, timestamps are represented as a vector where each index $i$ in the vector holds the event counter of a process $P_{i}$ , e.g.: $[c_{1}, c_{2}, . . ., c_{n}]$ .

Let $V C_{i}$ be the vector clock of process $P_{i}$ and $t s (m)$ be the timestamp of message $m$ :

$V C_{i} [i]$ is the number of events that have occurred so far at $P_{i}$ .
$V C_{i} [j]$ is $P_{i}$ ’s knowledge of the local time at $P_{j}$ .

[1, P. 318]

Vector clocks are sent with messages according to the following rules:

Before executing an event $P_{i}$ increments $V C_{i} [i]$ .
When $P_{i}$ sends a message $m$ to process $P_{j}$ it sets $m$ ’s vector timestamp to $V C_{i}$ after executing step 1.
When $P_{j}$ receives a message it adjusts $V C_{j} [k]$ to $m a x (V C_{j} [k], t s (m) [k])$ for each $k$ (equivalent to merging causal histories). It then executes the first step to record the receipt of the message.

[1, P. 318]

This data allows a process to determine whether an event must have occurred before another event by comparing the values of each process’ timestamp at the time an event occurred [4, P. 59].

You can also use vector clocks to determine if two events are concurrent, which, depending on the event, could cause conflicts [1, P. 319].

Mutual exclusion

Mutual exclusion is needed to ensure that resources cannot be accessed multiple times, since concurrent access could lead to corrupted data [1, P. 321].

There are two main categories of distributed mutual exclusion algorithms:

Token-based solutions
Permission-based solutions

In token-based solutions a token is passed between processes. Only processes that hold a token can access the shared resource. The downside to token-based solutions is that if the token is lost a new token must be generated (which can be a complicated procedure) [1, P. 322].

In permission-based solutions, a process that wishes to access a resource must gain permission from other processes first [1, P. 322].

Election algorithms

Election algorithms are used to pick a node to perform a special role (e.g. a coordinator) [1, P. 329].

Generally, election algorithms work by deciding which reachable node has the highest ID [1, P. 329].

The goal of an election algorithm is to ensure that an election process ends with all processes agreeing on an elected node [1, P. 330].

The bully algorithm

The bully algorithm uses the following steps to elect a new coordinator:

$P_{i}$ sends an ELECTION message to all processes with higher identifiers.
If no one responds, $P_{i}$ wins the election and becomes coordinator.
If one of the higher-valued nodes responds, it takes over from $P_{i}$ .

A node can receive an ELECTION message from a lower-valued node at any moment. If it does, the receiver sends OK to the sender to indicate that the node is alive and taking over, and the receiver then holds an election [1, P. 330].

Eventually all processes give up but one, which is the newly elected node. It announces its victory by sending a message to all processes [1, P. 330].

If a process that was down comes back up, it holds an election [1, P. 330].

References

[1] A. Tanenbaum and M. van Steen, Distributed Systems, 3.01 ed. Pearson Education, Inc., 2017.
[2] L. Lamport, “Time, Clocks, and the Ordering of Events in a Distributed System,” Commun. ACM, vol. 21, no. 7, pp. 558–565, Jul. 1978.
[3] C. H. Papadimitriou, “The serializability of concurrent database updates,” Journal of the ACM (JACM), vol. 26, no. 4, pp. 631–653, 1979.
[4] C. J. Fidge, “Timestamps in message-passing systems that preserve the partial ordering,” Proceedings of the 11th Australian Computer Science Conference, vol. 10, no. 1, pp. 56–66, 1988.