Optimum documentation

GaudiTrainer

Optimum

Overview

🤗 Optimum Installation Quick tour Notebooks

Conceptual guides

Nvidia

AMD

Intel

AWS Trainium/Inferentia

Google TPUs

Habana

🤗 Optimum Habana Installation Quickstart

Tutorials

How-To Guides

Conceptual Guides

Reference

Gaudi Trainer Gaudi Configuration Gaudi Stable Diffusion Pipeline Distributed Runner

Furiosa

ONNX Runtime

Exporters

BetterTransformer

Torch FX

LLM quantization

Utilities

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

GaudiTrainer

The GaudiTrainer class provides an extended API for the feature-complete Transformers Trainer. It is used in all the example scripts.

Before instantiating your GaudiTrainer, create a GaudiTrainingArguments object to access all the points of customization during training.

The GaudiTrainer class is optimized for 🤗 Transformers models running on Habana Gaudi.

Here is an example of how to customize GaudiTrainer to use a weighted loss (useful when you have an unbalanced training set):

from torch import nn
from optimum.habana import GaudiTrainer


class CustomGaudiTrainer(GaudiTrainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss (suppose one has 3 labels with different weights)
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0]))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

Another way to customize the training loop behavior for the PyTorch GaudiTrainer is to use callbacks that can inspect the training loop state (for progress reporting, logging on TensorBoard or other ML platforms…) and take decisions (like early stopping).

GaudiTrainer

class optimum.habana.GaudiTrainer

( model: Union = None gaudi_config: GaudiConfig = None args: TrainingArguments = None data_collator: Optional = None train_dataset: Union = None eval_dataset: Union = None tokenizer: Optional = None model_init: Optional = None compute_metrics: Optional = None callbacks: Optional = None optimizers: Tuple = (None, None) preprocess_logits_for_metrics: Optional = None )

GaudiTrainer is built on top of the tranformers’ Trainer to enable deployment on Habana’s Gaudi.

autocast_smart_context_manager

( cache_enabled: Optional = True )

A helper wrapper that creates an appropriate context manager for autocast while feeding it the desired arguments, depending on the situation. Modified by Habana to enable using autocast on Gaudi devices.

evaluate

( eval_dataset: Union = None ignore_keys: Optional = None metric_key_prefix: str = 'eval' )

From https://github.com/huggingface/transformers/blob/v4.38.2/src/transformers/trainer.py#L3162 with the following modification

comment out TPU related
use throughput_warmup_steps in evaluation throughput calculation

evaluation_loop

( dataloader: DataLoader description: str prediction_loss_only: Optional = None ignore_keys: Optional = None metric_key_prefix: str = 'eval' )

Prediction/evaluation loop, shared by Trainer.evaluate() and Trainer.predict(). Works both with or without labels.

log

( logs: Dict )

Parameters

logs (Dict[str, float]) — The values to log.

Log logs on the various objects watching training. Subclass and override this method to inject custom behavior.

prediction_loop

( dataloader: DataLoader description: str prediction_loss_only: Optional = None ignore_keys: Optional = None metric_key_prefix: str = 'eval' )

Prediction/evaluation loop, shared by Trainer.evaluate() and Trainer.predict(). Works both with or without labels.

prediction_step

( model: Module inputs: Dict prediction_loss_only: bool ignore_keys: Optional = None ) → Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]

Parameters

model (torch.nn.Module) — The model to evaluate.
inputs (Dict[str, Union[torch.Tensor, Any]]) — The inputs and targets of the model. The dictionary will be unpacked before being fed to the model. Most models expect the targets under the argument labels. Check your model’s documentation for all accepted arguments.
prediction_loss_only (bool) — Whether or not to return the loss only.
ignore_keys (List[str], optional) — A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions.

Returns

Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]

A tuple with the loss, logits and labels (each being optional).

Perform an evaluation step on model using inputs. Subclass and override to inject custom behavior.

save_model

( output_dir: Optional = None _internal_call: bool = False )

Will save the model, so you can reload it using from_pretrained(). Will only save from the main process.

train

( resume_from_checkpoint: Union = None trial: Union = None ignore_keys_for_eval: Optional = None **kwargs )

Parameters

resume_from_checkpoint (str or bool, optional) — If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.
trial (optuna.Trial or Dict[str, Any], optional) — The trial run or the hyperparameter dictionary for hyperparameter search.
ignore_keys_for_eval (List[str], optional) — A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training.
kwargs (Dict[str, Any], optional) — Additional keyword arguments used to hide deprecated arguments

Main training entry point.

training_step

( model: Module inputs: Dict ) → torch.Tensor

Parameters

model (torch.nn.Module) — The model to train.
inputs (Dict[str, Union[torch.Tensor, Any]]) — The inputs and targets of the model.

The dictionary will be unpacked before being fed to the model. Most models expect the targets under the argument labels. Check your model’s documentation for all accepted arguments.

Returns

torch.Tensor

The tensor with training loss on this batch.

Perform a training step on a batch of inputs.

Subclass and override to inject custom behavior.

GaudiSeq2SeqTrainer

class optimum.habana.GaudiSeq2SeqTrainer

( model: Union = None gaudi_config: GaudiConfig = None args: GaudiTrainingArguments = None data_collator: Optional = None train_dataset: Optional = None eval_dataset: Union = None tokenizer: Optional = None model_init: Optional = None compute_metrics: Optional = None callbacks: Optional = None optimizers: Tuple = (None, None) preprocess_logits_for_metrics: Optional = None )

evaluate

( eval_dataset: Optional = None ignore_keys: Optional = None metric_key_prefix: str = 'eval' **gen_kwargs )

Parameters

eval_dataset (Dataset, optional) — Pass a dataset if you wish to override self.eval_dataset. If it is an Dataset, columns not accepted by the model.forward() method are automatically removed. It must implement the __len__ method.
ignore_keys (List[str], optional) — A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions.
metric_key_prefix (str, optional, defaults to "eval") — An optional prefix to be used as the metrics key prefix. For example the metrics “bleu” will be named “eval_bleu” if the prefix is "eval" (default)
max_length (int, optional) — The maximum target length to use when predicting with the generate method.
num_beams (int, optional) — Number of beams for beam search that will be used when predicting with the generate method. 1 means no beam search. gen_kwargs — Additional generate specific kwargs.

Run evaluation and returns metrics. The calling script will be responsible for providing a method to compute metrics, as they are task-dependent (pass it to the init compute_metrics argument). You can also subclass and override this method to inject custom behavior.

predict

( test_dataset: Dataset ignore_keys: Optional = None metric_key_prefix: str = 'test' **gen_kwargs )

Parameters

test_dataset (Dataset) — Dataset to run the predictions on. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed. Has to implement the method __len__
ignore_keys (List[str], optional) — A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions.
metric_key_prefix (str, optional, defaults to "eval") — An optional prefix to be used as the metrics key prefix. For example the metrics “bleu” will be named “eval_bleu” if the prefix is "eval" (default)
max_length (int, optional) — The maximum target length to use when predicting with the generate method.
num_beams (int, optional) — Number of beams for beam search that will be used when predicting with the generate method. 1 means no beam search. gen_kwargs — Additional generate specific kwargs.

Run prediction and returns predictions and potential metrics. Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method will also return metrics, like in evaluate().

If your predictions or labels have different sequence lengths (for instance because you're doing dynamic padding in a token classification task) the predictions will be padded (on the right) to allow for concatenation into one array. The padding index is -100.

Returns: *NamedTuple* A namedtuple with the following keys: - predictions (`np.ndarray`): The predictions on `test_dataset`. - label_ids (`np.ndarray`, *optional*): The labels (if the dataset contained some). - metrics (`Dict[str, float]`, *optional*): The potential dictionary of metrics (if the dataset contained labels).

GaudiTrainingArguments

class optimum.habana.GaudiTrainingArguments

( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: Union = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: Optional = None per_gpu_eval_batch_size: Optional = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: Optional = None eval_delay: Optional = 0 torch_empty_cache_steps: Optional = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: Optional = 1e-06 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: Union = 'linear' lr_scheduler_kwargs: Union = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: Optional = 'passive' log_level_replica: Optional = 'warning' log_on_each_node: bool = True logging_dir: Optional = None logging_strategy: Union = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: Optional = False save_strategy: Union = 'steps' save_steps: float = 500 save_total_limit: Optional = None save_safetensors: Optional = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: Optional = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'hpu_amp' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: Optional = None local_rank: int = -1 ddp_backend: Optional = 'hccl' tpu_num_cores: Optional = None tpu_metrics_debug: bool = False debug: Union = '' dataloader_drop_last: bool = False eval_steps: Optional = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: Optional = None past_index: int = -1 run_name: Optional = None disable_tqdm: Optional = None remove_unused_columns: Optional = True label_names: Optional = None load_best_model_at_end: Optional = False metric_for_best_model: Optional = None greater_is_better: Optional = None ignore_data_skip: bool = False fsdp: Union = '' fsdp_min_num_params: int = 0 fsdp_config: Union = None fsdp_transformer_layer_cls_to_wrap: Optional = None accelerator_config: Union = None deepspeed: Union = None label_smoothing_factor: float = 0.0 optim: Union = 'adamw_torch' optim_args: Optional = None adafactor: bool = False group_by_length: bool = False length_column_name: Optional = 'length' report_to: Union = None ddp_find_unused_parameters: Optional = False ddp_bucket_cap_mb: Optional = 230 ddp_broadcast_buffers: Optional = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: Optional = None hub_model_id: Optional = None hub_strategy: Union = 'every_save' hub_token: Optional = None hub_private_repo: bool = False hub_always_push: bool = False gradient_checkpointing: bool = False gradient_checkpointing_kwargs: Union = None include_inputs_for_metrics: bool = False eval_do_concat_batches: bool = True fp16_backend: str = 'auto' evaluation_strategy: Union = None push_to_hub_model_id: Optional = None push_to_hub_organization: Optional = None push_to_hub_token: Optional = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: Optional = None ray_scope: Optional = 'last' ddp_timeout: Optional = 1800 torch_compile: bool = False torch_compile_backend: Optional = None torch_compile_mode: Optional = None dispatch_batches: Optional = None split_batches: Optional = None include_tokens_per_second: Optional = False include_num_input_tokens_seen: Optional = False neftune_noise_alpha: Optional = None optim_target_modules: Union = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: Optional = False eval_use_gather_object: Optional = False use_habana: Optional = False gaudi_config_name: Optional = None use_lazy_mode: Optional = True use_hpu_graphs: Optional = False use_hpu_graphs_for_inference: Optional = False use_hpu_graphs_for_training: Optional = False use_compiled_autograd: Optional = False compile_dynamic: Optional = None disable_tensor_cache_hpu_graphs: Optional = False max_hpu_graphs: Optional = None distribution_strategy: Optional = 'ddp' throughput_warmup_steps: Optional = 0 adjust_throughput: bool = False pipelining_fwd_bwd: Optional = False ignore_eos: Optional = True non_blocking_data_copy: Optional = False profiling_warmup_steps: Optional = 0 profiling_steps: Optional = 0 profiling_record_shapes: Optional = True fp8: Optional = False )

Parameters

use_habana (bool, optional, defaults to False) — Whether to use Habana’s HPU for running the model.
gaudi_config_name (str, optional) — Pretrained Gaudi config name or path.
use_lazy_mode (bool, optional, defaults to True) — Whether to use lazy mode for running the model.
use_hpu_graphs (bool, optional, defaults to False) — Deprecated, use use_hpu_graphs_for_inference instead. Whether to use HPU graphs for performing inference.
use_hpu_graphs_for_inference (bool, optional, defaults to False) — Whether to use HPU graphs for performing inference. It will speed up latency but may not be compatible with some operations.
use_hpu_graphs_for_training (bool, optional, defaults to False) — Whether to use HPU graphs for performing inference. It will speed up training but may not be compatible with some operations.
use_compiled_autograd (bool, optional, defaults to False) — Whether to use compiled autograd for training. Currently only for summarization models.
compile_dynamic (bool|None, optional, defaults to None) — Set value of ‘dynamic’ parameter for torch.compile.
disable_tensor_cache_hpu_graphs (bool, optional, defaults to False) — Whether to disable tensor cache when using hpu graphs. If True, tensors won’t be cached in hpu graph and memory can be saved.
max_hpu_graphs (int, optional) — Maximum number of hpu graphs to be cached. Reduce to save device memory.
distribution_strategy (str, optional, defaults to ddp) — Determines how data parallel distributed training is achieved. May be: ddp or fast_ddp.
throughput_warmup_steps (int, optional, defaults to 0) — Number of steps to ignore for throughput calculation. For example, with throughput_warmup_steps=N, the first N steps will not be considered in the calculation of the throughput. This is especially useful in lazy mode where the first two or three iterations typically take longer.
adjust_throughput (‘bool’, optional, defaults to False) — Whether to remove the time taken for logging, evaluating and saving from throughput calculation.
pipelining_fwd_bwd (bool, optional, defaults to False) — Whether to add an additional mark_step between forward and backward for pipelining host backward building and HPU forward computing.
non_blocking_data_copy (bool, optional, defaults to False) — Whether to enable async data copy when preparing inputs.
profiling_warmup_steps (int, optional, defaults to 0) — Number of steps to ignore for profling.
profiling_steps (int, optional, defaults to 0) — Number of steps to be captured when enabling profiling.

GaudiTrainingArguments is built on top of the Tranformers’ TrainingArguments to enable deployment on Habana’s Gaudi.

GaudiSeq2SeqTrainingArguments

class optimum.habana.GaudiSeq2SeqTrainingArguments

( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: Union = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: Optional = None per_gpu_eval_batch_size: Optional = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: Optional = None eval_delay: Optional = 0 torch_empty_cache_steps: Optional = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: Optional = 1e-06 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: Union = 'linear' lr_scheduler_kwargs: Union = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: Optional = 'passive' log_level_replica: Optional = 'warning' log_on_each_node: bool = True logging_dir: Optional = None logging_strategy: Union = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: Optional = False save_strategy: Union = 'steps' save_steps: float = 500 save_total_limit: Optional = None save_safetensors: Optional = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: Optional = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'hpu_amp' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: Optional = None local_rank: int = -1 ddp_backend: Optional = 'hccl' tpu_num_cores: Optional = None tpu_metrics_debug: bool = False debug: Union = '' dataloader_drop_last: bool = False eval_steps: Optional = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: Optional = None past_index: int = -1 run_name: Optional = None disable_tqdm: Optional = None remove_unused_columns: Optional = True label_names: Optional = None load_best_model_at_end: Optional = False metric_for_best_model: Optional = None greater_is_better: Optional = None ignore_data_skip: bool = False fsdp: Union = '' fsdp_min_num_params: int = 0 fsdp_config: Union = None fsdp_transformer_layer_cls_to_wrap: Optional = None accelerator_config: Union = None deepspeed: Union = None label_smoothing_factor: float = 0.0 optim: Union = 'adamw_torch' optim_args: Optional = None adafactor: bool = False group_by_length: bool = False length_column_name: Optional = 'length' report_to: Union = None ddp_find_unused_parameters: Optional = False ddp_bucket_cap_mb: Optional = 230 ddp_broadcast_buffers: Optional = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: Optional = None hub_model_id: Optional = None hub_strategy: Union = 'every_save' hub_token: Optional = None hub_private_repo: bool = False hub_always_push: bool = False gradient_checkpointing: bool = False gradient_checkpointing_kwargs: Union = None include_inputs_for_metrics: bool = False eval_do_concat_batches: bool = True fp16_backend: str = 'auto' evaluation_strategy: Union = None push_to_hub_model_id: Optional = None push_to_hub_organization: Optional = None push_to_hub_token: Optional = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: Optional = None ray_scope: Optional = 'last' ddp_timeout: Optional = 1800 torch_compile: bool = False torch_compile_backend: Optional = None torch_compile_mode: Optional = None dispatch_batches: Optional = None split_batches: Optional = None include_tokens_per_second: Optional = False include_num_input_tokens_seen: Optional = False neftune_noise_alpha: Optional = None optim_target_modules: Union = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: Optional = False eval_use_gather_object: Optional = False use_habana: Optional = False gaudi_config_name: Optional = None use_lazy_mode: Optional = True use_hpu_graphs: Optional = False use_hpu_graphs_for_inference: Optional = False use_hpu_graphs_for_training: Optional = False use_compiled_autograd: Optional = False compile_dynamic: Optional = None disable_tensor_cache_hpu_graphs: Optional = False max_hpu_graphs: Optional = None distribution_strategy: Optional = 'ddp' throughput_warmup_steps: Optional = 0 adjust_throughput: bool = False pipelining_fwd_bwd: Optional = False ignore_eos: Optional = True non_blocking_data_copy: Optional = False profiling_warmup_steps: Optional = 0 profiling_steps: Optional = 0 profiling_record_shapes: Optional = True fp8: Optional = False sortish_sampler: bool = False predict_with_generate: bool = False generation_max_length: Optional = None generation_num_beams: Optional = None generation_config: Union = None )

Parameters

sortish_sampler (bool, optional, defaults to False) — Whether to use a sortish sampler or not. Only possible if the underlying datasets are Seq2SeqDataset for now but will become generally available in the near future. It sorts the inputs according to lengths in order to minimize the padding size, with a bit of randomness for the training set.
predict_with_generate (bool, optional, defaults to False) — Whether to use generate to calculate generative metrics (ROUGE, BLEU).
generation_max_length (int, optional) — The max_length to use on each evaluation loop when predict_with_generate=True. Will default to the max_length value of the model configuration.
generation_num_beams (int, optional) — The num_beams to use on each evaluation loop when predict_with_generate=True. Will default to the num_beams value of the model configuration.
generation_config (str or Path or transformers.generation.GenerationConfig, optional) — Allows to load a transformers.generation.GenerationConfig from the from_pretrained method. This can be either:
- a string, the model id of a pretrained model configuration hosted inside a model repo on huggingface.co.
- a path to a directory containing a configuration file saved using the transformers.GenerationConfig.save_pretrained method, e.g., ./my_model_directory/.
- a transformers.generation.GenerationConfig object.

GaudiSeq2SeqTrainingArguments is built on top of the Tranformers’ Seq2SeqTrainingArguments to enable deployment on Habana’s Gaudi.

to_dict

( )

Serializes this instance while replace Enum by their values and GaudiGenerationConfig by dictionaries (for JSON serialization support). It obfuscates the token values by removing their value.

< > Update on GitHub

←What are Habana's Gaudi and HPUs? Gaudi Configuration→