# Modeling Communication in Cache-Coherent SMP Systems A Case-Study with Xeon Phi

#### Sabela Ramos<sup>1</sup> (sramos@udc.es) Torsten Hoefler<sup>2</sup> (htor@inf.ethz.ch)

<sup>1</sup>Computer Architecture Group, University of A Coruña (Spain) <sup>2</sup>Scalable Parallel Computing Lab, ETH Zurich (Switzerland)

22nd ACM Intl. Symp. on HPDC, New York City, 2013





#### Motivation



3 Design of Communication Algorithms







- Increase in the number of cores per processor.
- x86 processors offer Cache Coherency (Xeon Phi).
  - Cache-coherency as the only means of communication between cores
  - CC protocols implemented using a state machine.
- PROPOSAL:
  - Model of the transition cost in the state machine.
  - Simplification of the full model.
  - Application of the model to algorithm optimization.

ntel Xeon Phi Architecture Communication Models





ntel Xeon Phi Architecture Communication Models





ntel Xeon Phi Architecture Communication Models



ntel Xeon Phi Architecture Communication Models



ntel Xeon Phi Architecture Communication Models



- Where is the line located?
- Which is the state of the line? Is it modified?
- Do I have to fetch it from memory?

ntel Xeon Phi Architecture Communication Models

## Cache-coherency protocols: MESI

There are others like MOESI, MESIF or extended MESI<sup>1</sup>.



<sup>1</sup>M state can be extended to allow sharing of modified lines.

ntel Xeon Phi Architecture Communication Models

### Transition diagram for extended MESI

(E,I)



| $\setminus \tau$ | CORE 1 |  |
|------------------|--------|--|
| $\sum_{i=1}^{n}$ |        |  |
| /                |        |  |

ntel Xeon Phi Architecture Communication Models





| $\setminus \tau$ | CORE 1 |  |
|------------------|--------|--|
| $\sum_{i=1}^{n}$ |        |  |
| (                |        |  |

ntel Xeon Phi Architecture Communication Models



ntel Xeon Phi Architecture Communication Models

### Transition diagram for extended MESI



ntel Xeon Phi Architecture Communication Models



ntel Xeon Phi Architecture Communication Models



ntel Xeon Phi Architecture Communication Models



ntel Xeon Phi Architecture Communication Models

### Transition diagram for extended MESI



S. Ramos, T. Hoefler

Modeling Communication in Cache-Coherent SMP Systems

ntel Xeon Phi Architecture Communication Models

### Transition diagram for extended MESI



S. Ramos, T. Hoefler

Modeling Communication in Cache-Coherent SMP Systems

Intel Xeon Phi Architecture Communication Models

## Intel Xeon Phi Architecture





- Intel Xeon Phi 5110P, 60 cores at 1056 MHz (4 threads per core).
- Vector Processing Unit with 64 byte registers.
- L1: 32 kb Data + 32 kb Instructions
- L2: 512 kb unified

Intel Xeon Phi Architecture Communication Models

# Intel Xeon Phi-Cache Coherency Protocol





- Distributed Tag Directories.
  - GOLS coherency state.
  - Lines assigned by hash function based on the address.
  - Even load distribution but no locality of the network.
  - Benchmarking needs randomization in the accesses to avoid bias

Intel Xeon Phi Architecture Communication Models

### Parametrization

- BenchIT<sup>2</sup> to measure cache line transfer latencies.
  - No significant difference among cached states (S, M, E).
  - Distance between cores is nearly irrelevant, less than 5% (due to DTDs).
  - Three costs: R<sub>L</sub>, R<sub>R</sub>, R<sub>I</sub>



<sup>&</sup>lt;sup>2</sup>D. Molka, D. Hackenberg, R. Schoene and M. S. Mueller. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In Proc. 18th Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT'09), pages 261–270, Raleigh, NC, USA, 2009.

Intel Xeon Phi Architecture Communication Models

### Parametrization

- No significant difference among cached states (S, M, E).
- Distance between cores is nearly irrelevant, less than 5% (due to DTDs).
- Three costs: R<sub>L</sub>, R<sub>R</sub>, R<sub>I</sub>



Intel Xeon Phi Architecture Communication Models

### Parametrization

- No significant difference among cached states (S, M, E).
- Distance between cores is nearly irrelevant, less than 5% (due to DTDs).
- Three costs: R<sub>L</sub>, R<sub>R</sub>, R<sub>I</sub>



S. Ramos, T. Hoefler Modelin

Modeling Communication in Cache-Coherent SMP Systems

Intel Xeon Phi Architecture Communication Models

### Parametrization

- No significant difference among cached states (S, M, E).
- Distance between cores is nearly irrelevant, less than 5% (due to DTDs).
- Three costs: R<sub>L</sub>, R<sub>R</sub>, R<sub>I</sub>



Intel Xeon Phi Architecture Communication Models

### Parametrization

- No significant difference among cached states (S, M, E).
- Distance between cores is nearly irrelevant, less than 5% (due to DTDs).
- Three costs: R<sub>L</sub>, R<sub>R</sub>, R<sub>I</sub>



S. Ramos, T. Hoefler

Modeling Communication in Cache-Coherent SMP Systems

Intel Xeon Phi Architecture Communication Models

### Parametrization

- No significant difference among cached states (S, M, E).
- Distance between cores is nearly irrelevant, less than 5% (due to DTDs).
- Three costs: R<sub>L</sub>, R<sub>R</sub>, R<sub>I</sub>



Intel Xeon Phi Architecture Communication Models



Intel Xeon Phi Architecture Communication Models



Intel Xeon Phi Architecture Communication Models



Intel Xeon Phi Architecture Communication Models



Intel Xeon Phi Architecture Communication Models

### The Multi-line Ping-Pong Model



Intel Xeon Phi Architecture Communication Models

### **DTD** Contention Model



 $\mathcal{T}_{C}(n_{th}) = R_{L} + R_{R} + c \cdot (n_{th} - 1) = b + c \cdot n_{th}$ 

Intel Xeon Phi Architecture Communication Models

### DTD Contention Model



Intel Xeon Phi Architecture Communication Models

# **Ring Congestion**



- Benchmarking using several interleaved ping-pongs.
- Our results showed no congestion derived from having several communicating pairs accessing diferent memory addresses.

Small Broadcast Other Operations





Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations



Small Broadcast Other Operations

#### **Thread Interference**



### SOLUTION: Min-Max Models.

Small Broadcast Other Operations

#### **Small Broadcast**

- Receiver-driven approach
- Parameters: k<sub>1</sub> = 3, k<sub>2</sub> = 2, d = 2



S. Ramos, T. Hoefler

Modeling Communication in Cache-Coherent SMP Systems

Small Broadcast Other Operations

### Small Broadcast



Parameters: k<sub>1</sub> = 3, k<sub>2</sub> = 2, d = 2



Algorithms Evaluation

### Small Broadcast

- Receiver-driven approach
- Parameters: k<sub>1</sub> = 3, k<sub>2</sub> = 2, d = 2



Small Broadcast Other Operations

#### **Small Broadcast**

Receiver-driven approach

Parameters:  $k_1 = 3$ ,  $k_2 = 2$ , d = 2



$$\mathcal{T}_{tree} = \sum_{i=1}^{d} \mathcal{T}_{\mathcal{C}}(k_i) = \sum_{i=1}^{d} (k_i \cdot c + b)$$

Small Broadcast Other Operations

#### Small Broadcast









$$\begin{aligned} \mathcal{T}_{sbcast} &= \min_{d,k_i} \left( \mathcal{T}_{fw} + \mathcal{T}_{tree} + \sum_{i=1}^{d} \mathcal{T}_{nb}(k_i+1) \right) \\ N &\leq 1 + \sum_{i=1}^{d} \prod_{j=1}^{i} k_j, \ \forall i < j, k_i \leq k_j \end{aligned}$$

Small Broadcast Other Operations

# **Other Operations**

- Large Broadcast
  - Pipelined tree.
  - Flat Tree.
- Barrier Synchronization
  - Dissemination barrier.
- Small Reduction

**Discussion and Conclusions** 

#### Small Broadcast Performance



**Discussion and Conclusions** 

#### Small Broadcast Model



**Discussion and Conclusions** 

#### Large Broadcast Model



**Discussion and Conclusions** 

# Barrier Synchronization



**Discussion and Conclusions** 

## **Small Reduction**



Discussion and Conclusion Questions

- Optimizing for cache-coherency protocols is hard.
- Impact of interference caused by polling: min-max models. Is DRCA the solution?
- The model is able to guide algorithm design and development.
  - It does not provide a precise prediction but a range of possible performance.
  - The algorithms developed are up to 4.3 times faster than Intel MPI and OpenMP libraries.

Discussion and Conclusion Questions

#### **Questions?**

#### MODELING COMMUNICATION IN CACHE-COHERENT SMP SYSTEMS A Case Study with Xeon Phi

HPDC 2013

Sabela Ramos sramos@udc.es