

### **T. HOEFLER**

## **Scalable and Efficient AI: From Supercomputers to Smartphones**

with contributions by the whole SPCL deep learning team (T. Ben-Nun, S. Li, K. Osawa, N. Dryden and many others), Microsoft Azure (M. Heddes, J. Belk, S. Scott, D. Goel, M. Castro) and collaborators (D. Alistarh and others) Opening Keynote at the International Conference on Parallel Processing, 2023, Salt Lake City, UT, USA

#### Were you created with parallel processing?

Yes, I was created with parallel processing. Parallel processing is a computing technique when multiple streams of calculations or data processing tasks co-occur through numerous central processing units (CPUs) working concurrently<sup>1</sup>. This allows me to process information and generate responses quickly and efficiently. Is there anything else you would like to know about parallel processing? <sup>3</sup>

#### What is the best conference for parallel processing?

There are several conferences that focus on parallel processing. One of the most well-known is the **International Conference on Parallel Processing (ICPP)**<sup>1</sup>. It is one of the oldest continuously running computer science conferences in parallel computing in the world and is a premier forum for researchers, scientists, and practitioners in academia, industry, and government to present their latest research findings in all aspects of the field <sup>1</sup>.

#### Are there researchers working on large-scale AI models like yourself speaking at ICPP 2023?

Yes, there are researchers working on large-scale AI models speaking at the International Conference on Parallel Processing (ICPP) 2023<sup>1</sup>. One of the keynote speakers is **Torsten Hoefler**, who will be giving a talk on "Scalable and Efficient AI: From Supercomputers to Smartphones"<sup>1</sup>. In his talk, he will outline several techniques ranging from data ingestion, parallelization, to accelerator optimization that improve the efficiency of training systems for billion-parameter artificial intelligence models<sup>1</sup>. He will also discuss model compression and optimization techniques such as fine-grained sparsity as well as quantization to reduce model size and significantly improve efficiency during inference<sup>1</sup>. Is there anything else you would like to know?

#### Is he an entertaining speaker?

Based on his credentials and accomplishments, it is likely that he is an engaging and informative speaker. If you are interested in learning about the latest developments in scalable and efficient AI, then attending his talk at the International Conference on Parallel Processing (ICPP) 2023 would be a great opportunity.

M. Wolfe, 10 mins ago: "Compilers will never be smarter than a capable programmer" let this stand the test of time ©

Institute of

Science and

Technolog

Austria



Apps | 7 D







"Really the deciding factor [for the Al revolution] was the increase in compute power" (26:50) "I think a lot of the credit for deep learning goes to [... others ...] and the people who made the computers go fast." (27:00)

# 2018 ACM A.M. Turing Lecture

June 23, 2019 5:15pm MST



**Geoffrey Hinton** 

Yann LeCun

https://www.youtube.com/watch?v=VsnQf7exv5I



## How do we "Make Computers go Fast"?

# 2021 Turing award – Jack Dongarra The Take Away

Supercomputers are very (>70%) efficient at dense linear algebra!



- HPC Hardware is Constantly Changing
  - Scalar
  - Vector
  - Distributed
  - Accelerated
  - Mixed precision
- Three computer revolutions
  - High performance computing
  - Deep learning
  - Edge & AI
- · Algorithm / Software advances follows hardware
  - And there is "plenty of room at the top"



"There's plenty of room at the Top: What will drive computer





## FINANCIAL TIMES

Artificial intelligence ( + Add to

+ Add to myFT

# The billion-dollar bet to reach human-level AI

OpenAI believes that huge computing power is key driver

In the race to build a machine with human-level intelligence, it seems, size really matters.

"We think the most benefits will go to whoever has the biggest computer," said Greg Brockman, chairman and chief technology officer of OpenAI.

The San Francisco-based AI research group, set up four years ago by tech industry luminaries including Elon Musk, Peter Thiel and Reid Hoffman, has just thrown down a challenge to the rest of the AI world.





1.00

0.00

0.00

0.00

0.00

0.00

0.00

## **Supercomputers fuel Modern Al**

## Facebook parent Meta creates powerful AI supercomputer

Facebook's parent company Meta says it has created what it believes is among the fastest artificial intelligence supercomputers running today

By The Associated Press January 24, 2022, 10:33 PM 🕫 Share

# Tesla unveils Dojo supercomputer: world's new most powerful AI training machine

Fred Lambert - Aug. 20th 2021 3:08 am PT 🎔 @FredericLambert

#### BABY STEPS Google artificial intelligence supercomputer creates its own 'AI child' that can outperform its human-made rivals

The NASNet system was created by a neural network called AutoML earlier this year Mark Hodge

15:22, 5 Dec 2017 | Updated: 11:27, 6 Dec 2017

## Microsoft invests \$1 billion in OpenAl to pursue holy grail of artificial intelligence

Building artificial general intelligence is OpenAl's ambitious goal By James Vincent | Jul 22, 2019, 10:08am EDT





f(x) not not 0.74 sometimes sometimes 0.28 always 0.07 always 0.04 never never 0.33 and and 0.02 boat boat 0.02 house house layer-wise weight update

- GPT-3: 500 billion tokens
- ImageNet (22k): A few TB
- Soon: the whole internet!

- GPT-3: 96 (complex) layers
  175 bn parameters (700 GiB in fp32)
  2048-token "sentences"
- GPT-3: 30-50k dictionaries
- takes weeks to train

T. Ben-Nun, TH: Demystifying parallel and distributed deep learning: An in-depth concurrency analysis, ACM Computing Surveys (CSUR), 2019







# Large-Scale AI is the Future

# We need a Principled Approach to it



## Three Systems Dimensions in Large-scale Super-learning ...





## High-Performance I/O

- Quickly growing data volumes
  - Scientific computing!
- Use the specifics of machine learning workloads
  - E.g., intelligent prefetching

CLAIRVOYANT PREFETCHING FOR DISTRIBUTED MACHINE LEARNING I/O

Roman Böhringer<sup>1</sup> Nikoli Dryden<sup>1</sup> Tal Ben-Nun<sup>1</sup> Torsten Hoefler

#### ABSTRACT

I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments such as clouds and supercomputers. Optimal data ingestion pipelines differ between systems, and increasing efficiency requires a delicate balance between access to local storage, external filesystems, and remote workers; yet existing frameworks fail to efficiently utilize such resources. We observe that, given the seed generating the random access pattern for training with SGD, we have *clairvoyance* and can exactly predict when a given sample will be accessed. We combine this with a theoretical analysis of access patterns in training and performance modeling to produce a novel machine learning 1/O middleware, HDMLP, to tackle the I/O bottleneck. HDMLP provides an easy-to-use, flexible, and scalable solution that delivers better performance than state-of-the-art approaches while requiring very few changes to existing codebases and supporting a broad range of environments.

### **High-Performance Compute**

- Deep learning is HPC
  - Data movement!
- Quantization, Sparsification
  - Drives modern accelerators!



### **High-Performance Communication**

- Use larger clusters (10k+ GPUs)
- Model parallelism
  - Complex pipeline schemes
- Optimized networks

## **Distribution and Parallelism**





## **High-Performance I/O for Deep Learning**

Nail

EHzürich

- Example: ResNet-50 3.8 Gflop inference,  $\approx$  3x for training
  - ImageNet is 150 GiB for  $\approx$ 1.3M images  $\rightarrow$  average size 115 kiB, range: 508B 15MiB
  - MLPerf v2.1 on one H100 81k samples/s → 9.3 GiB/s random access → ~50 SSDs / GPU Likely more for problems from scientific computing!
- Training on thousands of GPUs may need to manage 10,000s of SSDs



- But why do we need those even? Deep Learning workloads "randomly sample" input!
  - By "random", we really mean pseudo-random sequences with fixed seeds <sup>(2)</sup>

This enables clairvoyant prefetching!





# Clairvoyant Prefetching for Distributed Machine Learning I/O (arXiv 2101.08734)

NoPFS acts as a distributed cache – each node keeps cache – fully knowing about the future!



single-process access to samples for ImageNet with 16 processes







# Clairvoyant Prefetching for Distributed Machine Learning I/O (arXiv 2101.08734)

NoPFS acts as a distributed cache – each node keeps cache – fully knowing about the future!



ImageNet 1k with ResNet-50

▶ @spcl ¥ @spcl\_eth

**ETH** zürich



# Clairvoyant Prefetching for Distributed Machine Learning I/O (arXiv 2101.08734)

NoPFS acts as a distributed cache – each node keeps cache – fully knowing about the future!





@spcl

👿 @spcl\_eth

**ETH** zürich

## runtime per epoch (full training time)



ImageNet 1k with ResNet-50



## Three Systems Dimensions in Large-scale Super-learning ...





**High-Performance I/O** 

- Quickly growing data volumes
  - Scientific computing!
- Use the specifics of machine learning workloads
  - E.g., intelligent prefetching

CLAIRVOYANT PREFETCHING FOR DISTRIBUTED MACHINE LEARNING I/O

Roman Böhringer  $^{+}$  Nikoli Dryden  $^{+}$  Tal Ben-Nun  $^{+}$  Torsten Hoefler  $^{+}$ 

#### BSTRACT

I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments such as clouds and supercomputers. Optimal data ingestion pipelines differ between systems, and increasing efficiency requires a delicate balance between access to local storage, external filesystems, and remet workers; yet existing frameworks fail to efficiently utilize such resources. We observe that, given the seed generating the random access pattern for training with SGD, we have *clairvoyance* and can exactly predict when a given sample will be accessed. We combine this with a theoretical analysis of access patterns in training and performance modeling to produce a novel machine learning I/O middleware, HDMLP, to tackle the I/O bottleneck. HDMLP provides an easy-to-use, flexible, and scalable solution that delivers better performance than state-of-the-art approaches while requiring very few changes to existing codebases and supporting a broad range of environments.

### **High-Performance Compute**

- Deep learning is HPC
  - Data movement!
- Quantization, Sparsification
  - Drives modern accelerators!



### **High-Performance Communication**

- Use larger clusters (10k+ GPUs)
- Model parallelism
  - Complex pipeline schemes
- Optimized networks

## **Distribution and Parallelism**





## Data Movement Is All You Need: A Case Study on Optimizing Transformers (arXiv:2007.00072)



OpenAl booth at NeurIPS 2019 in Vancouver, Canada Image Credit: Khari Johnson / VentureBeat

Last week, OpenAI published a paper <u>detailing</u> GPT-3, a machine learning model that achieves strong results on a number of natural language benchmarks. At 175 billion parameters, where a parameter affects data's prominence in an overall prediction, it's the largest of its kind. And with a memory size exceeding 350GB, it's one of the priciest, costing an estimated \$12 million to train.

| highly                    |        |           |
|---------------------------|--------|-----------|
| Operator class            | % flop | % Runtime |
| Tensor contraction        | 99.80  | 61.0      |
| Statistical normalization | 0.17   | 25.5      |
| Element-wise              | 0.03   | 13.5      |
|                           | 0.2%   | 39%       |

## **Our performance improvement for BERT-large**

- 30% over PyTorch
- 20% over Tensorflow + XLA
- 8% over DeepSpeed

est. savings on AWS over PyTorch: \$85k for BERT, \$3.6M GPT-3



## Data Movement Is All You Need: A Case Study on Optimizing Transformers (arXiv:2007.00072)









# Moving Data is Most Expensive!

# **Techniques to Shrink ML Data**



# Quantization – Running Gigantic LLMs on Reasonable Systems (arXiv:2210.17323)

- Brains have limited precision! Why are we computing with FP32?
  - For technical reasons (SGD, optimization, how we quantize)
  - Neurons in Hippocampus can "reliably distinguish 24 strengths" [1]
    4.6 bits of information!
- GPT-3 has up to 175 billion parameters
  - 700 GiB in FP32, 350 GiB in FP16/BF16 Θ
  - Rounding to <5 bits is not so simple</p>
  - Requires some foundation and many tricks
- Consider "error landscape" of a trained model with weights w [2]





@spcl

🕤 @spcl eth

EHzürich







## Quantization – Running Gigantic LLMs on Reasonable Systems (arXiv:2210.17323

50 45

N 40

05 WikiTe

<u>२</u> 25

ê 20

صل 15

10

no

- Quantization objective for low precision rounded weights  $\hat{w}$ argmin $_{\hat{w}} ||wx - \hat{w}x||^2$
- Solve PTQ optimization problem row by row of w
  - Round row and push the error forward using the inverse Hessian
  - Update Hessian for each column
- Tricks
  - Block updates for better locality (10x speedup)
  - Use Cholesky to invert Hessian (higher stability)
  - Work one transformer block at a time (6 operators fit in memory)
  - Use quantized input from previous blocks for block i
- Results
  - Generative inference 2-4x faster
  - 3 bits → 66 GiB, fits in a single (high-end) A100 GPU!

| Model    | FP16 | 1024  | 512   | 256   | 128  | 64   | 32   | 3-bit |
|----------|------|-------|-------|-------|------|------|------|-------|
| OPT-175B | 8.34 | 11.84 | 10.85 | 10.00 | 9.58 | 9.18 | 8.94 | 8.68  |
| BLOOM    | 8.11 | 11.80 | 10.84 | 10.13 | 9.55 | 9.17 | 8.83 | 8.64  |

Table 6: 2-bit GPTQ quantization results with varying group-sizes; perplexity on WikiText2.



Figure 1: Quantizing OPT models to 4 and BLOOM models to 3 bit precision, comparing GPTQ with the FP16 baseline and round-to-nearest (RTN) [34, 5].







# Quantization Reduces Data by an Order of Magnitude

# How to Go Further?



## Model Sparsification ... (arXiv:2102.00554)

## Brains are not densely connected! Why are DNN computations dense?

- For technical reasons (training, implementation etc.)
- We may want to shift towards sparse!

Intuition: not all features are always relevant!

- Represent as (sparse) vector space
- Less overfitting
- Interpretability
- Parsimony

the f\_t\_re wi\_l b\_ sp\_rs\_

Key results:

- 95% sparse ResNet-52,
  BERT, or GPT models
- Essentially same quality
- Up to 20x cheaper!



## Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks



#### 1 INTRODUCTION

Deep learning shows unparalleled promise for solving very complex real-world problems in areas such as computer vision, natural language processing, knowledge representation, recommendation systems, drug discovery, and many more. With this development, the field of machine learning is moving from traditional feature engineering to neural architecture engineering. However, still

Hoefler et al. "Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks", arXiv 2102.00554, Jan 2021



#### ETHzürich @spcl 🕤 @spcl eth

## **Sparse ML Computations – Very Different from Scientific Computing!**





## Programming Sparse Models – Meet PyTorch Sten (arXiv:2304.07613)



### Selected Available Sparsifiers:



A REAL AND A REAL PROPERTY AND A REAL PROPERTY

▶ @spcl

👿 @spcl\_eth

**ETH** zürich





## **Sten Performance**



12 - a charter







# Model Compression Enables

# **More Efficient Processing**

# Which Makes Data Movement Even More Important!

# **Especially in the Network!**



## Three Systems Dimensions in Large-scale Super-learning ...





**High-Performance I/O** 

- Quickly growing data volumes
  - Scientific computing!
- Use the specifics of machine learning workloads
  - E.g., intelligent prefetching

CLAIRVOYANT PREFETCHING FOR DISTRIBUTED MACHINE LEARNING I/O

Roman Böhringer<sup>+</sup> Nikoli Dryden<sup>+</sup> Tal Ben-Nun<sup>+</sup> Torsten Hoefler<sup>+</sup>

#### BSTRACT

I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments such as clouds and supercomputers. Optimal data ingestion pipelines differ between systems, and increasing efficiency requires a delicate balance between access to local storage, external filesystems, and remet workers; yet existing frameworks fail to efficiently utilize such resources. We observe that, given the seed generating the random access pattern for training with SGD, we have *clairvoyance* and can exactly predict when a given sample will be accessed. We combine this with a theoretical analysis of access patterns in training and performance modeling to produce a novel machine learning I/O middleware, HDMLP, to tackle the I/O bottleneck. HDMLP provides an easy-to-use, flexible, and scalable solution that delivers better performance than state-of-the-art approaches while requiring very few changes to existing codebases and supporting a broad range of environments. **High-Performance Compute** 

- Deep learning is HPC
  - Data movement!
- Quantization, Sparsification
  - Drives modern accelerators!



#### **High-Performance Communication**

- Use larger clusters (10k+ GPUs)
- Model parallelism
  - Complex pipeline schemes
- Optimized networks

## **Distribution and Parallelism**





## The Three Dimensions of Parallelism in Deep Learning (arXiv:1802.09941)

The section





# **Data-parallel** Gradient Sparsification – Top-k SGD (arXiv:1809.10505)

- Turns out 90-99.9% of the smallest gradient values can be skipped in the summation at similar accuracy
  - Accumulate the skipped values locally (convergence proof, similar to async. SGD with implicit staleness bounds [1])





# SparCML – Sparse Allreduce for Decentral Updates (arXiv:1802.08021)









## Microsoft Speech Production Workload Results – 2 weeks → 2 days!

| System             | Dataset  | Model   | # of nodes | Algorithm            | Speedup                      |
|--------------------|----------|---------|------------|----------------------|------------------------------|
| Piz Daint          | ImageNet | VGG19   | 8          | Q4                   | 1.55 (3.31)                  |
| Piz Daint          | ImageNet | AlexNet | 16         | Q4                   | 1.30 (1.36)                  |
| Piz Daint  <br>EC2 | MNIST    | MLP     | 8          | Top16_Q4<br>Top16_Q4 | 3.65 (4.53)<br>19.12 (22.97) |



## **Sparse Allreduce – A Headache for Systems Work**

## Flare: Flexible In-Network Allreduce

Salvatore Di Girolamo

salvatore.digirolamo@inf.ethz.ch

ETH Zurich

Zurich, Switzerland

Daniele De Sensi daniele.desensi@inf.ethz.ch ETH Zurich Zurich, Switzerland

> Shigang Li shigang.li@inf.ethz.ch ETH Zurich Zurich, Switzerland

#### ABSTRACT

Ladies [2] showing  $\mathcal{L}_{\mathcal{L}}$  and  $\mathcal{L}_{\mathcal{L$ The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, the data received from the hosts, and send th result. However, existing solu opportunities and dealing with cuwhen reproducib these problems, in switch by using as plementing the sPL and analyze different this architecture, sho to state-of-the-art app

#### CCS CONCEPTS

 Networks → In-network processing;
 Hardware → Networking hardware; • Computer systems organization -> Distributed architectures.

#### **KEYWORDS**

In-Network Computing; Programmable Switch; Allreduce

#### ACM Reference Format:

Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2018. Flare: Flexible In-Network Allreduce. In Supercomputing '21: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov 14-19, 2021, St. Louis, MO. ACM, New

Saleh Ashkboos saleh.ashkboos@inf.ethz.ch ETH Zurich Zurich, Switzerland

Torsten Hoefler torsten.hoefler@inf.ethz.ch ETH Zurich Zurich, Switzerland

> PI\_Allreduce is the orithm is the Raben-8]. This algorithm allgather phase. ch of these two nessages, each educed). The then  $2(P-1)\frac{Z}{R} \approx 2Z$ . nitted data, and thus increase the can exploit in-network compute, i.e., they can all reduce operation to the switches in the network.

o outline the advantages of performing an in-network allreduce, we describe the general idea underlying most existing in-network reduction approaches [9-11]. We first suppose to have the *P* hosts connected through a single switch. Each of the hosts sends its data to the switch, that aggregates together the vectors coming from all the hosts, and then sends them back the aggregated vector. Differently from the host-based optimal allreduce, in the in-network allreduce each host only sends Z elements, thus leading to a 2x reduction in the amount of transmitted data. If the switches can aggregate the received data at line rate, this leads to a 2x bandwidth improvement compared to a host-based allreduce. Besides improvements in the bandwidth, in-network allreduce also reduces the network traffic. Because the interconnection network consumes a large fraction of the overall system power (from 15% to 50% depending on the system load [12]), any reduction in the network traffic would also help in reducing the power consumption and thus the running cost of the system.

#### Near-Optimal Sparse Allreduce for Distributed Deep Learning

Shigang Li shigang.li@inf.ethz.ch Department of Computer Science, ETH Zurich Switzerland

#### Abstract

Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse *allreduce* algorithm and (2)the sparsification overhead. This paper proposes Okscheme for distributed training with spa Topk integrates a novel sparse all 6k communication vol with the dec (SOT dor ilar optin Ok-To and significantly improves training .g., 3.29x-12.95x improvement for BERT on 256 throug GPUs).

CCS Concepts: • Theory of computation  $\rightarrow$  Parallel algorithms; • Computing methodologies  $\rightarrow$  Neural networks.

Keywords: distributed deep learning, allreduce, gradient sparsification, data parallelism

Torsten Hoefler htor@inf.ethz.ch Department of Computer Science, ETH Zurich Switzerland

introducing up to 99.9% zero values without significant loss of accuracy. Only the nonzero values of the distributed gradients are accumulated across all processes. See [22] for an overview of gradient and other sparsification approaches in

parse sses 6] suffer from n, which also leads to a ata volume as P grows, and may depresentations on the fly. For example, let us the model has 1 million weights and it is 99% sparse at each node-thus, each node contributes its 10,000 largest gradient values and their indexes to the calculation. Let us now assume that the computation is distributed across 128 data-parallel nodes and the reduction uses a dissemination algorithm [20, 28] with 7 stages. In stage one, each process communicates its 10,000 values to be summed up. Each process now enters the next stage with up to 20,000 values. Those again are summed up leading to up to 40,000 values in stage 3 (if the value indexes do not overlap). The number of values grows exponentially until the algorithm converges after 7 stages with 640,000 values (nearly dense!). Even with overlapping indexes, the fill-in will quickly diminish the benefits of gradient sparsity in practice and lead to large and



## The Three Dimensions of Parallelism in Deep Learning (arXiv:1802.09941)

MA LANG





## Bidirectional Pipelines – Meet Chimera (arXiv: 2107.06925v3)





S. Li, T. Hoefler: Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines, best paper candidate at Supercomputing, SC21



## Chimera Weak Scaling (arXiv: 2107.06925v3)





- 1.38x 2.34x speedup over synchronous approaches (GPipe, GEMS, DAPPLE)
  - Less bubbles
  - More balanced memory thus no recomputation
- 1.16x 2.01x speedup over asynchronous approaches (PipeDream-2BW, PipeDream)
  - More balanced memory thus no recomputation
  - Gradient accumulation thus low synch frequency



## The Three Dimensions of Parallelism in Deep Learning (arXiv:1802.09941)

Contra and and





# Operator Parallelism, i.e., Parallel Matrix Matrix Multiplication Remember those?

- Large MMMs dominate large language models!
  - e.g., GPT-3 multiples 12,288x12,288 matrices
    600 MiB in fp32 and 1.9 Tflop
  - generative inference multiplies tall & skinny matrices
- Distribute as operator parallelism
  - Heaviest communication dimension!
    Requires most optimization!
- COSMA [1] communication-optimal distributed MMM
  - Achieves tight I/O lower bound of  $Q \ge \min\left\{\frac{2mnk}{p\sqrt{S}} + S, 3\left(\frac{mnk}{p}\right)^{\frac{2}{3}}\right\}$
  - Uses partial replication with an outer-product schedule See paper for details and proofs!
- AutoDDL [2] combines operator-parallel models into communication-avoiding data distribution

| [1] G. Kwasniewski et al.: "Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication", best student paper at Supercomputing SC19 |  |
|---------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| [2] J. Chen et al.: "AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication", arXiv                                     |  |

|                           | All MMM! |           |  |
|---------------------------|----------|-----------|--|
| Operator class            | % flop   | % Runtime |  |
| Tensor contraction        | 99.80    | 61.0      |  |
| Statistical normalization | 0.17     | 25.5      |  |
| Element-wise              | 0.03     | 13.5      |  |





## The Three Dimensions of Parallelism in Deep Learning (arXiv:1802.09941)





## Communications in 3D Parallelism in Deep Learning (arXiv:2209.01346)



TH et. al.: HammingMesh: A Network Topology for Large-Scale Deep Learning, to appear at SC22 and arXiv (2209.01346)

🕨 @spcl

🥣 @spcl eth

**ETH** zürich


## **Co-designing an AI Supercomputer with Unprecedented and Cheap Bandwidth**





9

9

9:

9

## Bandwidth-cost-flexibility Tradeoffs (arXiv:2209.01346)

**Global Topology** (e.g., Fat Tree)

## HammingMesh

(many configurations)

<u></u>

9

 $\mathbf{\langle}$ 

5 S



**S S S** 

TH et. al.: HammingMesh: A Network Topology for Large-Scale Deep Learning, to appear at SC22 and arXiv (2209.01346)

Local Topology (e.g., 2D Torus)



## Three Systems Dimensions in Large-scale Super-learning ...



High-Performance I/O

- Quickly growing data volumes
  - Scientific computing!
- Use the specifics of machine learning workloads
  - E.g., intelligent prefetching

High-Performance Compute

- Deep learning is HPC
  - Data movement!

## What will the (near future bring)?

Drives modern accelerators

**High-Performance Communication** 

- Use larger clusters (10k+ GPUs)
- Model parallelism
  - Complex pipeline schemes
- Optimized networks

**Distribution and Parallelism** 

Some predictions for the future of HPC but also computing at large!

oman Böhringer<sup>†</sup> Nikoli Dryden<sup>†</sup> Tal Ben-Nun<sup>†</sup> Torsten Hoefler

#### BSTRACT

I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments such as clouds and supercomputer. Optimal data ingestion pipelines of differ bowers systems, and increasing efficiency requires a deficiate balance between access to local storage, external filesystems, and remote workers; yet existing finaneouts fail to efficiently utilize and resources. We observe that, given the aced generating the random access pattern for training with SGD, we have clatrorogenees. We observe that, given the seed generating the random access pattern for training with SGD, we have clatrorogenees. We observe that, given the seed generating the random access movem machine learning I/O middleware, HDMLP, to tackle the U/D bottleneck. HDMLP provides a newsy-to-use, flexible, and scalable solution that delivers better performance than state-of-the-art approaches while requiring very few changes to existing codebases and supporting a broad range of environments.







## **Prediction 1: Accelerators Converge**

## Al is a gravity well – HPC will follow



## erreiter and the second second

## **Future Accelerators ...**

- Most of the performance will be low precision arithmetic!
  - I would predict (C)FP8 or smaller
  - We can be lucky if we get some fp64!
- They will support quantization and sparsity in hardware
  - Vector scaling and zero points
- They will heavily be optimized towards data movement
  - Physical limits and cost introduce two fundamental constraints: Latency will become a problem Locality and sparse connectivity
  - Potentially hard to program



B. Wisniewski (Samsung) **Memory-coupled Compute** SPCL\_Bcast 01/19/23 <u>https://www.youtube.com/watch?v=KCrQtpx31CQ</u>



#### SPECIFICATIONS





Optimized topologies and network technologies. E.g., HammingMesh <u>https://www.youtube.com/watch?v=xxwT45ljG4o</u>



## **Sparse-Quantized Representations - SpQR**

#### SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

| <b>Tim Dettmers</b> * <sup>†</sup> |                      | <b>n Svirschevski</b> *   | Vage Egiazarian*        |
|------------------------------------|----------------------|---------------------------|-------------------------|
| University of Washing              |                      | iversity & Yandex         | HSE University & Yandex |
| <b>Denis Kuznedelev</b> *          | <b>Elias Frantar</b> | Saleh Ashkboos            | Alexander Borzunov      |
| Yandex & Skoltech                  | IST Austria          | ETH Zurich                | HSE University & Yandex |
| Torsten Hoefler                    |                      | Dan Alistarh              |                         |
| ETH Zurich                         |                      | IST Austria & NeuralMagic |                         |

#### Abstract

Recent advances in large language model (LLM) pretraining have led to highquality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. However, quantization down to 3-4 bits per parameter usually leads to moderate-to-high accuracy losses, especially for smaller models in the 1-10B parameter range, which are well-suited for edge deployments. To address this accuracy issue, we introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique which enables for the first time near-lossless compression of LLMs across model scales, while reaching similar compression levels to previous methods. SpQR works by identifying and isolating outlier weights, which cause particularlylarge quantization errors, and storing them in higher precision, while compressing all other weights to 3-4 bits, and achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup thus making powerful LLMs available to consumer without any downsides. SpQR comes with efficient algorithms for both encoding weights into its format, as well as decoding them efficiently at runtime<sup>3</sup>. Specifically, we provide an efficient GPU inference algorithm for SpQR which yields faster inference than 16-bit baselines at similar accuracy, while enabling memory compression gains of more than 4x.



(







## **Prediction 2: Programming and Tools Converge**

## Data Science as a gravity well – HPC will follow



2

3

6



the same for the



### Upleveling Programming in the 21<sup>st</sup> Century – Performance Metaprogramming

MA THE PARTY OF



Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs, SC19







## **Prediction 3: Networks Converge**

## Cloud as a gravity well – HPC will follow



#### ♥@spcl ♥@spcl\_eth

COVER FEATURE **TECHNOLOGY PREDICTIONS** 

## The Convergence of Hyperscale Data Center and High-Performance Computing Networks

Torsten Hoefler, ETH Zurich Ariel Hendel, Scala Computing Duncan Roweth, Hewlett Packard Enterprise

We discuss the differences and commonalities between network technologies used in supercomputers and data centers and outline a path to convergence at multiple layers. We predict that emerging smart networking solutions will accelerate that convergence.



Mark Griswold, Vahid Tabatabaee, Mohan Kalkunte, and Surendra Anubolu, Broadcom Siyuan Shen, ETH Zürich Moray McLaren, Google Abdul Kabbani and Steve Scott, Microsoft



## Ultra Ethernet Set Out to Create the Best AI/ML and HPC Interconnect!

## OVER FEATURE TECHNOLOGY PREDICTIONS Data Center Ethernet and Remote Direct Memory Access: Issues at Hyperscale

Torsten Hoefler<sup>®</sup>, ETH Zürich Duncan Roweth, Keith Underwood, and Robert Alverson, Hewlett Packard Enterprise Mark Griswold, Vahid Tabatabaee, Mohan Kalkunte, and Surendra Anubolu, Broadcom Siyuan Shen, ETH Zürich Moray McLaren, Google Abdul Kabbani and Steve Scott, Microsoft

# Ultra Ethernet



### Ultra **Ethernet**

white Paper on <u>ultraethernet.org</u>

Overview of and Motivation for the Forthcoming Ultra Ethernet Consortium Specification

### Networking Demands of Modern AI Jobs

Networking is increasingly important for efficient and cost-effective training of AI models. Large Language Models (LLMs) such as GPT-3, Chinchilla, and PALM, as well as recommendation systems like DLRM and DHEN, are trained on clusters of thousands of GPUs.



### **Key Points and Conclusions**

### More of SPCL's research:











Want to join our efforts? We're looking for excellent Postdocs, PhD students, and Visitors. Talk to me!

