Optimum documentation

What are Intel® Gaudi® 1, Intel® Gaudi® 2 and HPUs?

You are viewing v1.22.0 version. A newer version v1.23.3 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

What are Intel® Gaudi® 1, Intel® Gaudi® 2 and HPUs?

Intel Gaudi 1 and Intel Gaudi 2 are the first- and second-generation AI hardware accelerators designed by Habana Labs and Intel. A single server contains 8 devices called Habana Processing Units (HPUs) with 96GB of memory each on Gaudi2 and 32GB on first-gen Gaudi. Check out here for more information about the underlying hardware architecture.

The Habana SDK is called SynapseAI and is common to both first-gen Gaudi and Gaudi2. As a consequence, 🤗 Optimum Habana is fully compatible with both generations of accelerators.

Execution modes

Two execution modes are supported on HPUs for PyTorch, which is the main deep learning framework the 🤗 Transformers and 🤗 Diffusers libraries rely on:

  • Eager mode execution, where the framework executes one operation at a time as defined in standard PyTorch eager mode.
  • Lazy mode execution, where operations are internally accumulated in a graph. The execution of the operations in the accumulated graph is triggered in a lazy manner, only when a tensor value is required by the user or when it is explicitly required in the script. The SynapseAI graph compiler will optimize the execution of the operations accumulated in the graph (e.g. operator fusion, data layout management, parallelization, pipelining and memory management, graph-level optimizations).

See here how to use these execution modes in Optimum for Intel Gaudi.

Distributed training

First-gen Gaudi and Gaudi2 are well-equipped for distributed training:

  • Scale-up to 8 devices on one server. See here how to perform distributed training on a single node.
  • Scale-out to 1000s of devices on several servers. See here how to do multi-node training.

Inference

HPUs can also be used to perform inference:

  • Through HPU graphs that are well-suited for latency-sensitive applications. Check out here how to use them.
  • In lazy mode, which can be used the same way as for training.
< > Update on GitHub