### Design of Parallel and High-Performance Computing

Fall 2016 Lecture: Locks and Lock-Free

Motivational video: <u>https://www.youtube.com/watch?v=jhApQIPQquw</u>

Instructor: Torsten Hoefler & Markus Püschel TA: Salvatore Di Girolamo

ETH

Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

### Administrivia

- Final project presentation: Monday 12/19 during last lecture
  - Report will be due in January!
     Still, starting to write early is very helpful --- write rewrite rewrite (no joke!)
  - Some more ideas what to talk about:

What tools/programming language/parallelization scheme do you use? Which architecture? (we only offer access to Xeon Phi, you may use different) How to verify correctness of the parallelization? How to argue about performance (bounds, what to compare to?) (Somewhat) realistic use-cases and input sets? What are the key concepts employed? What are the main obstacles?

#### **Review of last lecture DPHPC Overview** DPHPC Language memory models parallelism locality Java/C++ memory model overview concepts & techniques Synchronized programming vector ISA shared memory distributed memory - caches - memory hierarchy cache coherency Locks memory distributed Broken two-thread locks models algorithms Peterson locks group commu-nications lock free wait free N-thread locks (filter lock) Many different locks, strengths and weaknesses linearizability Lock options and parameters Amdahl's and Gustafson's law LogP models memory PRAM Formal proof methods α-β Correctness (mutual exclusion as condition) I/O complexity Progress balance principles I balance principles II Little's Law scheduling 3

### **Goals of this lecture**

- More N-thread locks!
  - Hardware operations for concurrency control
- More on locks (using advanced operations)
  - Spin locks
  - Various optimized locks
- Even more on locks (issues and extended concepts)
  - Deadlocks, priority inversion, competitive spinning, semaphores
- Case studies
  - Barrier, reasoning about semantics
- Locks in practice: a set structure

### Lock Fairness

- Starvation freedom provides no guarantee on how long a thread waits or if it is "passed"!
- To reason about fairness, we define two sections of each lock algorithm:
  - Doorway D (bounded # of steps)
  - Waiting W (unbounded # of steps)

void lock() {
 int j = 1 - tid;
 flag[tid] = true; // I'm interested
 victim = tid; // other goes first
 while (flag[j] && victim == tid) {};

#### FIFO locks:

- If  $T_A$  finishes its doorway before  $T_B$  the  $CR_A \rightarrow CR_B$
- Implies fairness

#### Lamport's Bakery Algorithm (1974) Lamport's Bakery Algorithm Is a FIFO lock (and thus fair) Advantages: Elegant and correct solution Each thread takes a number in doorway and threads enter in the Starvation free, even FIFO fairness order of their number! Not used in practice! volatile int flag[n] = {0,0,...,0}; Why? volatile int label[n] = {0,0,....,0}; Needs to read/write N memory locations for synchronizing N threads void lock() { Can we do better? flag[tid] = 1; // request Using only atomic registers/memory label[tid] = max(label[0], ...,label[n-1]) + 1; // take ticket while $((\exists k != tid)(flag[k] \&\& (label[k],k) <* (label[tid],tid))) {};$ public void unlock() { flag[tid] = 0; A Lower Bound to Memory Complexity Hardware Support? Theorem 5.1 in [1]: "If S is a [atomic] read/write system with at least Hardware atomic operations: two processes and S solves mutual exclusion with global progress Test&Set [deadlock-freedom], then S must have at least as many variables as Write const to memory while returning the old value processes" Atomic swap Atomically exchange memory and register Fetch&Op So we're doomed! Optimal locks are available and they're Get value and apply operation to memory location fundamentally non-scalable. Or not? Compare&Swap Compare two values and swap memory with register if equal [1] J. E. Burns and N. A. Lynch. Bounds on shared memory for mutual Load-linked/Store-Conditional LL/SC exclusion. Information and Computation, 107(2):171–184, December Loads value from memory, allows operations, commits only if no other updates 1993 committed $\rightarrow$ mini-TM Intel TSX (transactional synchronization extensions) Hardware-TM (roll your own atomic operations) 10 9 **Test-and-Set Locks Relative Power of Synchronization**

- Design-Problem I: Multi-core Processor Which atomic operations are useful?
- **Design-Problem II: Complex Application** 
  - What atomic should I use?
- Concept of "consensus number" C if a primitive can be used to solve the "consensus problem" in a finite number of steps (even if threads stop)
  - atomic registers have C=1 (thus locks have C=1!)
  - TAS, Swap, Fetch&Op have C=2
  - CAS, LL/SC, TM have C=∞

- **Test-and-Set semantics** 
  - Memoize old value
  - Set fixed value TASval (true)
  - Return old value
- After execution: Post-condition is a fixed (constant) value!
- bool test\_and\_set (bool \*flag) { bool old = \*flag; \*flag = true; return old; }// all atomic!





Similar to: T. Anderson: "The performance of spin lock alternatives for shared-memory multiprocessors", TPDS, Vol. 1 Issue 1, Jan 1990

### Improvements?

#### Are TAS locks perfect?

- What are the two biggest issues?
- Cache coherency traffic (contending on same location with expensive atomics)

-- or --

- Critical section underutilization (waiting for backoff times will delay entry to CR)
- What would be a fix for that?
  - How is this solved at airports and shops (often at least)?

#### Queue locks -- Threads enqueue

- Learn from predecessor if it's their turn
- Each threads spins at a different location
- FIFO fairness

### **Comparison of TAS Locks**



### **Array Queue Lock**

 Array to implement queue

19

21

- Tail-pointer shows next free queue position
   Each thread spins on own
- location CL padding!
- index[] array can be put in TLS

#### So are we done now?

- What's wrong?
- Synchronizing M objects
- requires  $\Theta(NM)$  storage
- What do we do now?

volatile int array[n] = {1,0,...,0}; volatile int index[n] = {0,0,...,0}; volatile int tail = 0;

void lock() {
 index[tid] = GetAndInc(tail) % n;
 while (!array[index[tid]]); // wait to receive lock

#### void unlock() {

array[index[tid]] = 0; // I release my lock array[(index[tid] + 1) % n] = 1; // next one

22

## CLH Lock (1993)

- List-based (same queue principle)
- Discovered twice by Craig, Landin, Hagersten 1993/94
- 2N+3M words
   N threads, M locks
- Requires thread-local qnode pointer
  - Can be hidden!

typedef struct qnode { struct qnode \*prev; int succ\_blocked; } qnode;

qnode \*lck = new qnode; // node owned by lock

void lock(qnode \*lck, qnode \*qn) {
 qn->succ\_blocked = 1;
 qn->prev = FetchAndSet(lck, qn);
 while (qn->prev->succ\_blocked);

void unlock(qnode \*\*qn) {
 qnode \*pred = (\*qn)->prev;
 (\*qn)->succ\_blocked = 0;
 \*qn = pred;

# CLH Lock (1993)

- Qnode objects represent thread state!
  - succ\_blocked == 1 if waiting or acquired lock
  - succ\_blocked == 0 if released lock
- List is implicit!
  - One node per thread
  - Spin location changes NUMA issues (cacheless)
- Can we do better?

typedef struct qnode {
 struct qnode \*prev;
 int succ\_blocked;
} qnode;

qnode \*lck = new qnode; // node owned by lock

- void lock(qnode \*lck, qnode \*qn) {
   qn->succ\_blocked = 1;
   qn->prev = FetchAndSet(lck, qn);
   while (qn->prev->succ\_blocked);
- void unlock(qnode \*\*qn) {
   qnode \*pred = (\*qn)->prev;
   (\*qn)->succ\_blocked = 0;
   \*qn = pred;
  }

# MCS Lock (1991)

- Make queue explicit
   Acquire lock by
  - appending to queueSpin on own node until locked is reset
- Similar advantages as CLH but
  - Only 2N + M words
  - Spinning position is fixed! Benefits cache-less NUMA
- What are the issues?
  - Releasing lock spins
  - More atomics!

# typedef struct qnode { struct qnode \*next; int succ\_blocked; } qnode;

qnode \*lck = NULL;

- void lock(qnode \*lck, qnode \*qn) {
   qn->next = NULL;
   qnode \*pred = FetchAndSet(lck, qn);
   if(pred l= NULL) {
   qn->locked = 1;
   pred->next = qn;
   while(qn->locked);
   }}
- void unlock(qnode \* lck, qnode \*qn) {
   if(qn->next == NULL) { // if we're the last waiter
   if(CAS(lck, qn, NULL)) return;
   while(qn->next == NULL); // wait for pred arrival
   }
  }

qn->next->locked = 0; // free next waiter
qn->next = NULL;

### Lessons Learned!

#### Key Lesson:

- Reducing memory (coherency) traffic is most important!
- Not always straight-forward (need to reason about CL states)

#### MCS: 2006 Dijkstra Prize in distributed computing

- "an outstanding paper on the principles of distributed computing, whose significance and impact on the theory and/or practice of distributed computing has been evident for at least a decade"
- "probably the most influential practical mutual exclusion algorithm ever"
- "vastly superior to all previous mutual exclusion algorithms"
- fast, fair, scalable → widely used, always compared against!

### **Time to Declare Victory?**

- Down to memory complexity of 2N+M
  - Probably close to optimal
- Only local spinning
  - Several variants with low expected contention
- But: we assumed sequential consistency 😕
  - Reality causes trouble sometimes
  - Sprinkling memory fences may harm performanceOpen research on minimally-synching algorithms!
  - Come and talk to me if you're interested

### **More Practical Optimizations**

- Let's step back to "data race"
  - (recap) two operations A and B on the same memory cause a data race if one of them is a write ("conflicting access") and neither A→B nor B→A
  - So we put conflicting accesses into a CR and lock it! This also guarantees memory consistency in C++/Java!
- Let's say you implement a web-based encyclopedia
  - Consider the "average two accesses" do they conflict?

### **Reader-Writer Locks**

#### Allows multiple concurrent reads

- Multiple reader locks concurrently in CR
- Guarantees mutual exclusion between writer and writer locks and reader and writer locks

#### Syntax:

read\_(un)lock()write\_(un)lock()

A Simple RW Lock

#### Seems efficient!?

- Is it? What's wrong?
- Polling CAS!

### Is it fair?

- Readers are preferred!
- Can always delay writers (again and again and again)

const W = 1; const R = 2; volatile int lock=0; // LSB is writer flag!

void read\_lock(lock\_t lock) {
 AtomicAdd(lock, R);
 while(lock & W);
}

void write\_lock(lock\_t lock) {
 while(!CAS(lock, 0, W));

void read\_unlock(lock\_t lock) {
 AtomicAdd(lock, -R);

void write\_unlock(lock\_t lock) {
 AtomicAdd(lock, -W);

29

27

30

26



[http://research.microsoft.com/en-us/um/people/mbj/Mars\_Pathfinder/Authoritative\_Account.html]

### **Priority Inversion**

- If busy-waiting thread has higher priority than thread holding lock ⇒ no progress!
- Can be fixed with the help of the OS
  - E.g., mutex priority inheritance (temporarily boost priority of task in CR to highest priority among waiting tasks)

### **Condition Variables**

- Allow threads to yield CPU and leave the OS run queue
   Other threads can get them back on the queue!
- cond\_wait(cond, lock) yield and go to sleep
- cond\_signal(cond) wake up sleeping threads
- Wait and signal are OS calls
  - Often expensive, which one is more expensive?
     Wait, because it has to perform a full context switch

# **Condition Variable Semantics**

#### Hoare-style:

- Signaler passes lock to waiter, signaler suspended
- Waiter runs immediately
- Waiter passes lock back to signaler if it leaves critical section or if it waits again

#### Mesa-style (most used):

- Signaler keeps lock
- Waiter simply put on run queue
- Needs to acquire lock, may wait again

### When to Spin and When to Block?

- Spinning consumes CPU cycles but is cheap
  - "Steals" CPU from other threads
- Blocking has high one-time cost and is then free
  - Often hundreds of cycles (trap, save TCB ...)Wakeup is also expensive (latency)
  - Also cache-pollution

#### Strategy:

37

39

41

Poll for a while and then block

# When to Spin and When to Block?

#### What is a "while"?

- Optimal time depends on the future
  - When will the active thread leave the CR?
  - Can compute optimal offline schedule
  - Actual problem is an online problem

#### Competitive algorithms

- An algorithm is c-competitive if for a sequence of actions x and a constant a holds:
  - $C(x) \leq c^*C_{opt}(x) + a$
- What would a good spinning algorithm look like and what is the competitiveness?

### **Competitive Spinning**

- If T is the overhead to process a wait, then a locking algorithm that spins for time T before it blocks is 2-competitive!
  - Karlin, Manasse, McGeoch, Owicki: "Competitive Randomized Algorithms for Non-Uniform Problems", SODA 1989

#### If randomized algorithms are used, then e/(e-1)-competitiveness (~1.58) can be achieved

See paper above!

38

### **Generalized Locks: Semaphores**

- Controlling access to more than one resource Described by Dijkstra 1965
- Internal state is an atomic counter C
- Two operations:
  - P() block until C>0; decrement C (atomically)
  - V() signal and increment C
- Binary or 0/1 semaphore equivalent to lock C is always 0 or 1, i.e., V() will not increase it further
- Trivia:
  - If you're lucky (aehem, speak Dutch), mnemonics: Verhogen (increment) and Prolaag (probeer te verlagen = try to reduce)

### **Semaphore Implementation**

- Can be implemented with mutual exclusion! And can be used to implement mutual exclusion <sup>(3)</sup>
- ... or with test and set and many others!

#### Also has fairness concepts:

- Order of granting access to waiting (queued) threads
- strictly fair (starvation impossible, e.g., FIFO)
- weakly fair (starvation possible, e.g., random)

### **Case Study 1: Barrier**

#### Barrier semantics:

- No process proceeds before all processes reached barrier
- Similar to mutual exclusion but not exclusive. rather "synchronized"
- Often needed in parallel high-performance programming Especially in SPMD programming style
- Parallel programming "frameworks" offer barrier semantics (pthread, OpenMP, MPI)
  - MPI\_Barrier() (process-based)
  - pthread\_barrier
  - #pragma omp barrier
  - •

#### Simple implementation: lock xadd + spin

Problem: when to re-use the counter? Cannot just set it to 0  $\otimes$   $\rightarrow$  Trick: "lock xadd -1" when done  $\otimes$ 

[cf. http://www.spiral.net/software/barrier.html]

### **Case Study 2: Reasoning about Semantics**

#### Comments on a Problem in Concurrent Programming Control

Dear Editor:

I would like to comment on Mr. Dijkstra's solution [Solution of a problem in concurrent programming control. Comm ACM 8 (Sopt. 1965), 569] to a messy problem that is hardly academic. We are using it now on a multiple computer complex.

When there are only two computers, the algorithm may be simplified to the following:

 simplified to the ionowing:

 Boolean array b(0; 1) integer k, i, j, 

 comment This is the program for computer i, which may be either 0 or 1, computer  $j \neq i$  is the other one, 1 or 0;

 C0: b(i) := false; 

 C1: if  $k \neq i$  then begin

 C2: if not b(j) then go to C2;

 else k := i; go to C1 end;

 else k := i; go to C1 end;

 b(i) := true;

 remainder of program;

 go to C0;

go to C0; end

practical problem.

Volume 9 Issue 1, Jan. 1966

Mr. Dijkstra has come up with a clever solution to a really HARRIS HYMAN

Munitype New York, New York

### **Barrier Performance**

43

45

47



### **Case Study 2: Reasoning about Semantics**

#### Is the proposed algorithm correct?

- We may proof it manually Using tools from the last lecture
- ightarrow reason about the state space of H Or use automated proofs (model checking) E.g., SPIN (Promela syntax)

bool want[2]; bool turn: byte cnt;

#### proctype P(bool i)

### want[i] = 1;

- do :: (turn != i) -> (!want[1-i]);
- turn = i
- :: (turn == i) -> break
- od; skip; /\* critical section \*/ cnt = cnt+1: assert(cnt == 1); cnt = cnt-1;
- want[i] = 0

init { run P(0); run P(1) }

44





They become successively more complex

5. Lock-free

Lazy locking

4.

59

#### **Tricks Overview Tricks Overview** 1. Fine-grained locking 1. **Fine-grained locking** Reader/writer locking 2. Reader/writer locking 2. Multiple readers hold lock (traversal) 3. Optimistic synchronization . contains() only needs read lock Traverse without locking Locks may be upgraded during operation Need to make sure that this is correct! Must ensure starvation-freedom for writer locks! Acquire lock if update necessary . 3. Optimistic synchronization May need re-start from beginning, tricky Lazy locking Lazy locking 4. 4. 5. Lock-free 5. Lock-free 61 62 **Tricks Overview Tricks Overview** 1. Fine-grained locking 1. **Fine-grained locking** 2. Reader/writer locking 2. Reader/writer locking 3. Optimistic synchronization 3. Optimistic synchronization 4. Lazy locking 4. Lazy locking Postpone hard work to idle periods Lock-free 5. . Mark node deleted Completely avoid locks Delete it physically later Enables wait-freedom -Lock-free 5. Will need atomics (see later why!) Often very complex, sometimes higher overhead 63 64 **Trick 1: Fine-grained Locking** Hand-over-Hand (fine-grained) locking Each element can be locked typedef struct { High memory overhead int key; node \*next; lock\_t lock; Threads can traverse list concurrently like a pipeline Tricky to prove correctness } node; a 🗕 → b - And deadlock-freedom Two-phase locking (acquire, release) often helps Hand-over-hand (coupled locking) Not safe to release x's lock before acquiring x.next's lock will see why in a minute Important to acquire locks in the same order 65 66































