Neuron Model Cache

The Neuron Model Cache is a remote cache for compiled Neuron models in the neff format. It is integrated into the NeuronTrainer and NeuronModelForCausalLM classes to enable loading pretrained models from the cache instead of compiling them locally.

Note: it is not available for models exported using any other NeuronModelXX classes, that use a different export mechanism.

The Neuron Model Cache is hosted on the Hugging Face Hub and includes compiled files for all popular and supported optimum-neuron pre-trained models.

Before training a Transformers or Diffusion model or loading a NeuronModelForCausalLM on Neuron platforms, it needs to be exported to neuron format with torch-neuronx.

When exporting a model, torch-neuronx will:

convert it to a set of XLA subgraphs,
compile each subgraph with the neuronx compiler into a Neuron Executable File Format (NEFF) binary file.

The first step is relatively fast, but the compilation takes a lot of time. To avoid recompiling all NEFF files every time a model is loaded on a NeuronX host, torch-neuronx stores NEFF files in a local directory, usually /var/tmp/neuron-compile-cache.

However, this local cache is not shared between platforms, which means that every time you train or export a model on a new host, you need to recompile it.

We created the Neuron Model Cache to solve this limitation by providing a public repository of precompiled model graphs.

Note: we also support the creation of private, secured, remote model cache.

How to use the Neuron model cache

The public model cache will be used when you use the NeuronTrainer or NeuronModelForCausalLM classes. There are no additional changes needed.

When exporting a model to neuron format, optimum-neuron will simply look for cached NEFF files in the hub repository during the compilation of the model subgraphs.

If the NEFF files are cached, they will be fetched from the hub and directly loaded instead of being recompiled.

How caching works

The Optimum Neuron Cache is built on top of the NeuronX compiler cache.

It is important to understand that the cache operates on NEFF binaries, and not on the model itself.

As explained previously, each model exported to Neuron using the NeuronTrainer or NeuronModelForCausalLM is composed of XLA subgraphs.

Each subgraph is unique, and results from the combination of:

the transformers or transformers_neuronx python modeling code,
the transformers model config,
the input_shapes selected during the export,
The precision of the model, full-precision, fp16 or bf16.

When compiling a subgraph to a NEFF file, other parameters influence the result:

The version of the Neuron X compiler,
The number of Neuron cores used,
The compilation parameters (such as the optimization level).

All these parameters are combined together to create a unique hash that identifies a NEFF file.

This has two very important consequences:

it is only when actually exporting a model that the associated NEFF files can be identified,
even a small change in the model configuration will lead to a different set of NEFF files.

It is therefore very difficult to know in advance if the NEFFs associated to a specific model configuration are cached.

Neuron model cache lookup (inferentia only)

The neuron cache lookup is a feature allowing users to look for compatible cached model configurations before exporting a model for inference.

It is based on a dedicated registry composed of stored cached configurations.

Cached model configurations are stored as entries under a specific subfolder in the Neuron Model Cache:

neuronxcc-2.12.54.0+f631c2365
├── 0_REGISTRY
│   └── 0.0.18
│       └── llama
│           └── meta-llama
│               └── Llama-2-7b-chat-hf
│                   └── 54c1f6689cd88f246fce.json

Each entry corresponds to the combination of a model configuration and its export parameters: this is as close as we can get to uniquely identify the exported model.

You can use the optimum-cli to lookup for compatible cached entries by passing it a hub model_id or the path to a file containing a model config.json.

$ optimum-cli neuron cache lookup meta-llama/Llama-2-7b-chat-hf

*** 1 entrie(s) found in cache for meta-llama/Llama-2-7b-chat-hf ***

task: text-generation
batch_size: 1
num_cores: 24
auto_cast_type: fp16
sequence_length: 2048
compiler_type: neuronx-cc
compiler_version: 2.12.54.0+f631c2365
checkpoint_id: meta-llama/Llama-2-7b-chat-hf
checkpoint_revision: c1b0db933684edbfe29a06fa47eb19cc48025e93

Note that even if compatible cached entries exist, this does not always guarantee that the model will not be recompiled during export if you modified the compilation parameters or updated the neuronx packages.

Advanced usage (trainium only)

How to use a private Neuron model cache (trainium only)

The repository for the public cache is aws-neuron/optimum-neuron-cache. This repository includes all precompiled files for commonly used models so that it is publicly available and free to use for everyone. But there are two limitations:

You will not be able to push your own compiled files on this repo
It is public and you might want to use a private repo for private models

To alleviate that you can create your own private cache repository using the optimum-cli or set the environment variable CUSTOM_CACHE_REPO.

Using the Optimum CLI

The Optimum CLI offers 2 subcommands for cache creation and setting:

create: To create a new cache repository that you can use as a private Neuron Model cache.
set: To set the name of the Neuron cache repository locally, the repository needs to exists and will be used by default by optimum-neuron.

Create a new Neuron cache repository:

optimum-cli neuron cache create --help

usage: optimum-cli neuron cache create [-h] [-n NAME] [--public]

optional arguments:
  -h, --help            show this help message and exit
  -n NAME, --name NAME  The name of the repo that will be used as a remote cache for the compilation files.
  --public              If set, the created repo will be public. By default the cache repo is private.

The -n / --name option allows you to specify a name for the Neuron cache repo, if not set the default name will be used. The --public flag allows you to make your Neuron cache public as it will be created as a private repository by default.

Example:

optimum-cli neuron cache create

Neuron cache created on the Hugging Face Hub: michaelbenayoun/optimum-neuron-cache [private].
Neuron cache name set locally to michaelbenayoun/optimum-neuron-cache in /home/michael/.cache/huggingface/optimum_neuron_custom_cache.

Set a different Trainiun cache repository:

usage: optimum-cli neuron cache set [-h] name

positional arguments:
  name        The name of the repo to use as remote cache.

optional arguments:
  -h, --help  show this help message and exit

Example:

optimum-cli neuron cache set michaelbenayoun/optimum-neuron-cache

Neuron cache name set locally to michaelbenayoun/optimum-neuron-cache in /home/michael/.cache/huggingface/optimum_neuron_custom_cache

The optimum-cli neuron cache set command is useful when working on a new instance to use your own cache.

Using the environment variable CUSTOM_CACHE_REPO

Using the CLI is not always feasible, and not very practical for small testing. In this case, you can simply set the environment variable CUSTOM_CACHE_REPO.

For example, if your cache repo is called michaelbenayoun/my_custom_cache_repo, you just need to do:

CUSTOM_CACHE_REPO="michaelbenayoun/my_custom_cache_repo" torchrun ...

or:

export CUSTOM_CACHE_REPO="michaelbenayoun/my_custom_cache_repo"
torchrun ...

You have to be logged into the Hugging Face Hub to be able to push and pull files from your private cache repository.

Cache system flow

Cache system flow

At each the beginning of each training step, the NeuronTrainer computes a NeuronHash and checks the cache repo(s) (official and custom) on the Hugging Face Hub to see if there are compiled files associated to this hash. If that is the case, the files are downloaded directly to the local cache directory and no compilation is needed. Otherwise compilation is performed.

Just as for downloading compiled files, the NeuronTrainer will keep track of the newly created compilation files at each training step, and upload them to the Hugging Face Hub at save time or when training ends. This assumes that you have writing access to the cache repo, otherwise nothing will be pushed.

Optimum CLI

The Optimum CLI can be used to perform various cache-related tasks, as described by the optimum-cli neuron cache command usage message:

usage: optimum-cli neuron cache [-h] {create,set,add,list} ...

positional arguments:
  {create,set,add,list,synchronize,lookup}
    create              Create a model repo on the Hugging Face Hub to store Neuron X compilation files.
    set                 Set the name of the Neuron cache repo to use locally (trainium only).
    add                 Add a model to the cache of your choice (trainium only).
    list                List models in a cache repo (trainium only).
    synchronize         Synchronize local compiler cache with the hub cache (inferentia only).
    lookup              Lookup the neuronx compiler hub cache for the specified model id (inferentia only).

optional arguments:
  -h, --help            show this help message and exit

Add a model to the cache (trainium only)

It is possible to add a model compilation files to a cache repo via the optimum-cli neuron cache add command:

usage: optimum-cli neuron cache add [-h] -m MODEL --task TASK --train_batch_size TRAIN_BATCH_SIZE [--eval_batch_size EVAL_BATCH_SIZE] [--sequence_length SEQUENCE_LENGTH]
                                    [--encoder_sequence_length ENCODER_SEQUENCE_LENGTH] [--decoder_sequence_length DECODER_SEQUENCE_LENGTH]
                                    [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS] --precision {fp,bf16} --num_cores
                                    {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32} [--max_steps MAX_STEPS]

When running this command a small training session will be run and the resulting compilation files will be pushed.

Make sure that the Neuron cache repo to use is set up locally, this can be done by running the `optimum-cli neuron cache set` command. You also need to make sure that you are logged in to the Hugging Face Hub and that you have the writing rights for the specified cache repo, this can be done via the `huggingface-cli login` command.

If at least one of those requirements is not met, the command will fail.

Example:

optimum-cli neuron cache add \
  --model prajjwal1/bert-tiny \
  --task text-classification \
  --train_batch_size 16 \
  --eval_batch_size 16 \
  --sequence_length 128 \
  --gradient_accumulation_steps 32 \
  --num_cores 32 \
  --precision bf16

This will push compilation files for the prajjwal1/bert-tiny model on the Neuron cache repo that was set up for the specified parameters.

List a cache repo

It can also be convenient to request the cache repo to know which compilation files are available. This can be done via the optimum-cli neuron cache list command:

usage: optimum-cli neuron cache list [-h] [-m MODEL] [-v VERSION] [name]

positional arguments:
  name                  The name of the repo to list. Will use the locally saved cache repo if left unspecified.

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        The model name or path of the model to consider. If left unspecified, will list all available models.
  -v VERSION, --version VERSION
                        The version of the Neuron X Compiler to consider. Will list all available versions if left unspecified.

As you can see, it is possible to:

List all the models available for all compiler versions.
List all the models available for a given compiler version by specifying the -v / --version argument.
List all compilation files for a given model, there can be many for different input shapes and so on, by specifying the -m / --model argument.

Example:

optimum-cli neuron cache list aws-neuron/optimum-neuron-cache

AWS Trainium & Inferentia

Neuron Model Cache

How to use the Neuron model cache

How caching works

Neuron model cache lookup (inferentia only)

Advanced usage (trainium only)

How to use a private Neuron model cache (trainium only)

Using the Optimum CLI

Using the environment variable CUSTOM_CACHE_REPO

Cache system flow

Optimum CLI

Add a model to the cache (trainium only)

List a cache repo