Generic model classes


The NeuronTracedModel class is available for instantiating a base Neuron model without a specific head. It is used as the base class for all tasks but text generation.

class optimum.neuron.NeuronTracedModel

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )

Base class running compiled and optimized models on Neuron devices.

It implements generic methods for interacting with the Hugging Face Hub as well as compiling vanilla transformers models to neuron-optimized TorchScript module and export it using optimum.exporters.neuron toolchain.

Class attributes:

  • model_type (str, optional, defaults to "neuron_model") — The name of the model type to use when registering the NeuronTracedModel classes.
  • auto_model_class (Type, optional, defaults to AutoModel) — The AutoModel class to be represented by the current NeuronTracedModel class.

Common attributes:

  • model (torch.jit._script.ScriptModule) — The loaded ScriptModule compiled for neuron devices.
  • config (PretrainedConfig) — The configuration of the model.
  • model_save_dir (Path) — The directory where a neuron compiled model is saved. By default, if the loaded model is local, the directory where the original model will be used. Otherwise, the cache directory will be used.


< >

( )

Returns whether this model can generate sequences with .generate().


< >

( neuron_config: NeuronDefaultConfig )

Gets a dictionary of inputs with their valid static shapes.


< >

( path: Union to_neuron: bool = False device_id: int = 0 )


  • path (Union[str, Path]) — Path of the compiled model.
  • to_neuron (bool, defaults to False) — Whether to move manually the traced model to NeuronCore. It’s only needed when inline_weights_to_neff=False, otherwise it is loaded automatically to a Neuron device.
  • device_id (int, defaults to 0) — Index of NeuronCore to load the traced model to.

Loads a TorchScript module compiled by neuron(x)-cc compiler. It will be first loaded onto CPU and then moved to one or multiple NeuronCore.


< >

( outputs: List dims: List indices: List padding_side: Literal = 'right' )


  • outputs (List[torch.Tensor]) — List of torch tensors which are inference output.
  • dims (List[int]) — List of dimensions in which we slice a tensor.
  • indices (List[int]) — List of indices in which we slice a tensor along an axis.
  • padding_side (Literal["right", "left"], defaults to “right”) — The side on which the padding has been applied.

Removes padding from output tensors.


The NeuronDecoderModel class is the base class for text generation models.

class optimum.neuron.NeuronDecoderModel

< >

( config: PretrainedConfig checkpoint_dir: Union compiled_dir: Union = None generation_config: Optional = None )

Base class to convert and run pre-trained transformers decoder models on Neuron devices.

It implements the methods to convert a pre-trained transformers decoder model into a Neuron transformer model by:

  • transferring the checkpoint weights of the original into an optimized neuron graph,
  • compiling the resulting graph using the Neuron compiler.

Common attributes:

  • model (torch.nn.Module) — The decoder model with a graph optimized for neuron devices.
  • config (PretrainedConfig) — The configuration of the original model.
  • generation_config (GenerationConfig) — The generation configuration used by default when calling generate().

Natural Language Processing

The following Neuron model classes are available for natural language processing tasks.


class optimum.neuron.NeuronModelForFeatureExtraction

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with a BaseModelOutput for feature-extraction tasks.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Feature Extraction model on Neuron devices.


< >

( input_ids: Tensor attention_mask: Tensor token_type_ids: Optional = None **kwargs )


  • input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
  • attention_mask (Union[torch.Tensor, None] of shape (batch_size, sequence_length), defaults to None) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
  • token_type_ids (Union[torch.Tensor, None] of shape (batch_size, sequence_length), defaults to None) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

The NeuronModelForFeatureExtraction forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of feature extraction: (Following model is compiled with neuronx compiler and can only be run on INF2. Replace “neuronx” with “neuron” if you are using INF1.)

>>> from transformers import AutoTokenizer
>>> from optimum.neuron import NeuronModelForFeatureExtraction

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/all-MiniLM-L6-v2-neuronx")
>>> model = NeuronModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2-neuronx")

>>> inputs = tokenizer("Dear Evan Hansen is the winner of six Tony Awards.", return_tensors="pt")

>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> list(last_hidden_state.shape)
[1, 13, 384]


class optimum.neuron.NeuronModelForSentenceTransformers

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model for Sentence Transformers.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Sentence Transformers model on Neuron devices.


< >

( input_ids: Tensor attention_mask: Tensor pixel_values: Optional = None token_type_ids: Optional = None **kwargs )


  • input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
  • attention_mask (Union[torch.Tensor, None] of shape (batch_size, sequence_length), defaults to None) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
  • token_type_ids (Union[torch.Tensor, None] of shape (batch_size, sequence_length), defaults to None) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

The NeuronModelForSentenceTransformers forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of TEXT Sentence Transformers:

>>> from transformers import AutoTokenizer
>>> from optimum.neuron import NeuronModelForSentenceTransformers

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bge-base-en-v1.5-neuronx")
>>> model = NeuronModelForSentenceTransformers.from_pretrained("optimum/bge-base-en-v1.5-neuronx")

>>> inputs = tokenizer("In the smouldering promise of the fall of Troy, a mythical world of gods and mortals rises from the ashes.", return_tensors="pt")

>>> outputs = model(**inputs)
>>> token_embeddings = outputs.token_embeddings
>>> sentence_embedding = = outputs.sentence_embedding


class optimum.neuron.NeuronModelForMaskedLM

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with a MaskedLMOutput for masked language modeling tasks.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Masked language model for on Neuron devices.


< >

( input_ids: Tensor attention_mask: Tensor token_type_ids: Optional = None **kwargs )


  • input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
  • attention_mask (Union[torch.Tensor, None] of shape (batch_size, sequence_length), defaults to None) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
  • token_type_ids (Union[torch.Tensor, None] of shape (batch_size, sequence_length), defaults to None) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

The NeuronModelForMaskedLM forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of fill mask: (Following model is compiled with neuronx compiler and can only be run on INF2. Replace “neuronx” with “neuron” if you are using INF1.)

>>> from transformers import AutoTokenizer
>>> from optimum.neuron import NeuronModelForMaskedLM
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/legal-bert-base-uncased-neuronx")
>>> model = NeuronModelForMaskedLM.from_pretrained("optimum/legal-bert-base-uncased-neuronx")

>>> inputs = tokenizer("This [MASK] Agreement is between General Motors and John Murray.", return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)
[1, 13, 30522]


class optimum.neuron.NeuronModelForSequenceClassification

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Sequence Classification model on Neuron devices.


< >

( input_ids: Tensor attention_mask: Tensor token_type_ids: Optional = None **kwargs )


  • input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
  • attention_mask (Union[torch.Tensor, None] of shape (batch_size, sequence_length), defaults to None) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
  • token_type_ids (Union[torch.Tensor, None] of shape (batch_size, sequence_length), defaults to None) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

The NeuronModelForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of single-label classification: (Following model is compiled with neuronx compiler and can only be run on INF2.)

>>> from transformers import AutoTokenizer
>>> from optimum.neuron import NeuronModelForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english-neuronx")
>>> model = NeuronModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english-neuronx")

>>> inputs = tokenizer("Hamilton is considered to be the best musical of human history.", return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)
[1, 2]


class optimum.neuron.NeuronModelForQuestionAnswering

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with a QuestionAnsweringModelOutput for extractive question-answering tasks like SQuAD.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Question Answering model on Neuron devices.


< >

( input_ids: Tensor attention_mask: Tensor token_type_ids: Optional = None **kwargs )


  • input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
  • attention_mask (Union[torch.Tensor, None] of shape (batch_size, sequence_length), defaults to None) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
  • token_type_ids (Union[torch.Tensor, None] of shape (batch_size, sequence_length), defaults to None) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

The NeuronModelForQuestionAnswering forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of question answering: (Following model is compiled with neuronx compiler and can only be run on INF2.)

>>> import torch
>>> from transformers import AutoTokenizer
>>> from optimum.neuron import NeuronModelForQuestionAnswering

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2-neuronx")
>>> model = NeuronModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2-neuronx")

>>> question, text = "Are there wheelchair spaces in the theatres?", "Yes, we have reserved wheelchair spaces with a good view."
>>> inputs = tokenizer(question, text, return_tensors="pt")
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([12])

>>> outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits


class optimum.neuron.NeuronModelForTokenClassification

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Token Classification model on Neuron devices.


< >

( input_ids: Tensor attention_mask: Tensor token_type_ids: Optional = None **kwargs )


  • input_ids (torch.Tensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
  • attention_mask (Union[torch.Tensor, None] of shape (batch_size, sequence_length), defaults to None) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
  • token_type_ids (Union[torch.Tensor, None] of shape (batch_size, sequence_length), defaults to None) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

The NeuronModelForTokenClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of token classification: (Following model is compiled with neuronx compiler and can only be run on INF2.)

>>> from transformers import AutoTokenizer
>>> from optimum.neuron import NeuronModelForTokenClassification

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bert-base-NER-neuronx")
>>> model = NeuronModelForTokenClassification.from_pretrained("optimum/bert-base-NER-neuronx")

>>> inputs = tokenizer("Lin-Manuel Miranda is an American songwriter, actor, singer, filmmaker, and playwright.", return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> list(logits.shape)
[1, 20, 9]


class optimum.neuron.NeuronModelForMultipleChoice

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for RocStories/SWAG tasks.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Multiple choice model on Neuron devices.


< >

( input_ids: Tensor attention_mask: Tensor token_type_ids: Optional = None **kwargs )


  • input_ids (torch.Tensor of shape (batch_size, num_choices, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode and PreTrainedTokenizer.__call__ for details. What are input IDs?
  • attention_mask (Union[torch.Tensor, None] of shape (batch_size, num_choices, sequence_length), defaults to None) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
  • token_type_ids (Union[torch.Tensor, None] of shape (batch_size, num_choices, sequence_length), defaults to None) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

The NeuronModelForMultipleChoice forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of mutliple choice: (Following model is compiled with neuronx compiler and can only be run on INF2.)

>>> from transformers import AutoTokenizer
>>> from optimum.neuron import NeuronModelForMultipleChoice

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/bert-base-uncased_SWAG-neuronx")
>>> model = NeuronModelForMultipleChoice.from_pretrained("optimum/bert-base-uncased_SWAG-neuronx")

>>> num_choices = 4
>>> first_sentence = ["Members of the procession walk down the street holding small horn brass instruments."] * num_choices
>>> second_sentence = [
...     "A drum line passes by walking down the street playing their instruments.",
...     "A drum line has heard approaching them.",
...     "A drum line arrives and they're outside dancing and asleep.",
...     "A drum line turns the lead singer watches the performance."
... ]
>>> inputs = tokenizer(first_sentence, second_sentence, truncation=True, padding=True)

# Unflatten the inputs values expanding it to the shape [batch_size, num_choices, seq_length]
>>> for k, v in inputs.items():
...     inputs[k] = [v[i: i + num_choices] for i in range(0, len(v), num_choices)]
>>> inputs = dict(inputs.convert_to_tensors(tensor_type="pt"))
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> logits.shape
[1, 4]


class optimum.neuron.NeuronModelForCausalLM

< >

( config: PretrainedConfig checkpoint_dir: Union compiled_dir: Union = None generation_config: Optional = None )


  • model (torch.nn.Module) — torch.nn.Module is the neuron decoder graph.
  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model.
  • model_path (Path) — The directory where the compiled artifacts for the model are stored. It can be a temporary directory if the model has never been saved locally before.
  • generation_config (transformers.GenerationConfig) — GenerationConfig holds the configuration for the model generation task.

Neuron model with a causal language modeling head for inference on Neuron devices.

This model inherits from ~neuron.modeling.NeuronDecoderModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)


< >

( input_ids: Tensor cache_ids: Tensor start_ids: Tensor = None return_dict: bool = True )


  • input_ids (torch.LongTensor) — Indices of decoder input sequence tokens in the vocabulary of shape (batch_size, sequence_length).
  • cache_ids (torch.LongTensor) — The indices at which the cached key and value for the current inputs need to be stored.
  • start_ids (torch.LongTensor) — The indices of the first tokens to be processed, deduced form the attention masks.

The NeuronModelForCausalLM forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of text generation:

>>> from transformers import AutoTokenizer
>>> from optimum.neuron import NeuronModelForCausalLM
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = NeuronModelForCausalLM.from_pretrained("gpt2", export=True)

>>> inputs = tokenizer("My favorite moment of the day is", return_tensors="pt")

>>> gen_tokens = model.generate(**inputs, do_sample=True, temperature=0.9, min_length=20, max_length=20)
>>> tokenizer.batch_decode(gen_tokens)

Computer Vision

The following Neuron model classes are available for computer vision tasks.


class optimum.neuron.NeuronModelForImageClassification

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Neuron Model for image-classification tasks. This class officially supports beit, convnext, convnextv2, deit, levit, mobilenet_v2, mobilevit, vit, etc.


< >

( pixel_values: Tensor **kwargs )


  • pixel_values (Union[torch.Tensor, None] of shape (batch_size, num_channels, height, width), defaults to None) — Pixel values corresponding to the images in the current batch. Pixel values can be obtained from encoded images using AutoImageProcessor.

The NeuronModelForImageClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of image classification:

>>> import requests
>>> from PIL import Image
>>> from optimum.neuron import NeuronModelForImageClassification
>>> from transformers import AutoImageProcessor

>>> url = ""
>>> image =, stream=True).raw)

>>> preprocessor = AutoImageProcessor.from_pretrained("optimum/vit-base-patch16-224-neuronx")
>>> model = NeuronModelForImageClassification.from_pretrained("optimum/vit-base-patch16-224-neuronx")

>>> inputs = preprocessor(images=image, return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> predicted_label = logits.argmax(-1).item()

Example using transformers.pipeline:

>>> import requests
>>> from PIL import Image
>>> from transformers import AutoImageProcessor, pipeline
>>> from optimum.neuron import NeuronModelForImageClassification

>>> preprocessor = AutoImageProcessor.from_pretrained("optimum/vit-base-patch16-224-neuronx")
>>> model = NeuronModelForImageClassification.from_pretrained("optimum/vit-base-patch16-224-neuronx")
>>> pipe = pipeline("image-classification", model=model, feature_extractor=preprocessor)

>>> url = ""
>>> pred = pipe(url)


class optimum.neuron.NeuronModelForSemanticSegmentation

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with a semantic segmentation head on top, e.g. for Pascal VOC.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Neuron Model for semantic-segmentation, with an all-MLP decode head on top e.g. for ADE20k, CityScapes. This class officially supports mobilevit, mobilenet-v2, etc.


< >

( pixel_values: Tensor **kwargs )


  • pixel_values (Union[torch.Tensor, None] of shape (batch_size, num_channels, height, width), defaults to None) — Pixel values corresponding to the images in the current batch. Pixel values can be obtained from encoded images using AutoImageProcessor.

The NeuronModelForSemanticSegmentation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of semantic segmentation:

>>> import requests
>>> from PIL import Image
>>> from optimum.neuronimport NeuronModelForSemanticSegmentation
>>> from transformers import AutoImageProcessor

>>> url = ""
>>> image =, stream=True).raw)

>>> preprocessor = AutoImageProcessor.from_pretrained("optimum/deeplabv3-mobilevit-small-neuronx")
>>> model = NeuronModelForSemanticSegmentation.from_pretrained("optimum/deeplabv3-mobilevit-small-neuronx")

>>> inputs = preprocessor(images=image, return_tensors="pt")

>>> outputs = model(**inputs)
>>> logits = outputs.logits

Example using transformers.pipeline:

>>> import requests
>>> from PIL import Image
>>> from transformers import AutoImageProcessor, pipeline
>>> from optimum.neuron import NeuronModelForSemanticSegmentation

>>> preprocessor = AutoImageProcessor.from_pretrained("optimum/deeplabv3-mobilevit-small-neuronx")
>>> model = NeuronModelForSemanticSegmentation.from_pretrained("optimum/deeplabv3-mobilevit-small-neuronx")
>>> pipe = pipeline("image-segmentation", model=model, feature_extractor=preprocessor)

>>> url = ""
>>> pred = pipe(url)


class optimum.neuron.NeuronModelForObjectDetection

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with object detection heads on top, for tasks such as COCO detection.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Neuron Model for object-detection, with object detection heads on top, for tasks such as COCO detection.


< >

( pixel_values: Tensor **kwargs )


  • pixel_values (Union[torch.Tensor, None] of shape (batch_size, num_channels, height, width), defaults to None) — Pixel values corresponding to the images in the current batch. Pixel values can be obtained from encoded images using AutoImageProcessor.

The NeuronModelForObjectDetection forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of object detection:

>>> import requests
>>> from PIL import Image
>>> from optimum.neuronimport NeuronModelForObjectDetection
>>> from transformers import AutoImageProcessor

>>> url = ""
>>> image =, stream=True).raw)

>>> preprocessor = AutoImageProcessor.from_pretrained("hustvl/yolos-tiny")
>>> model = NeuronModelForObjectDetection.from_pretrained("hustvl/yolos-tiny", export=True, batch_size=1)

>>> inputs = preprocessor(images=image, return_tensors="pt")

>>> outputs = model(**inputs)
>>> target_sizes = torch.tensor([image.size[::-1]])
>>> results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[0]

Example using transformers.pipeline:

>>> import requests
>>> from PIL import Image
>>> from transformers import AutoImageProcessor, pipeline
>>> from optimum.neuron import NeuronModelForObjectDetection

>>> preprocessor = AutoImageProcessor.from_pretrained("hustvl/yolos-tiny")
>>> model = NeuronModelForObjectDetection.from_pretrained("hustvl/yolos-tiny")
>>> pipe = pipeline("object-detection", model=model, feature_extractor=preprocessor)

>>> url = ""
>>> pred = pipe(url)


The following auto classes are available for the following audio tasks.


class optimum.neuron.NeuronModelForAudioClassification

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with an audio classification head.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Neuron Model for audio-classification, with a sequence classification head on top (a linear layer over the pooled output) for tasks like SUPERB Keyword Spotting.


< >

( input_values: Tensor **kwargs )


  • input_values (torch.Tensor of shape (batch_size, sequence_length)) — Float values of input raw speech waveform.. Input values can be obtained from audio file loaded into an array using AutoProcessor.

The NeuronModelForAudioClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of audio classification:

>>> from transformers import AutoProcessor
>>> from optimum.neuron import NeuronModelForAudioClassification
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> feature_extractor = AutoProcessor.from_pretrained("Jingya/wav2vec2-large-960h-lv60-self-neuronx-audio-classification")
>>> model = NeuronModelForAudioClassification.from_pretrained("Jingya/wav2vec2-large-960h-lv60-self-neuronx-audio-classification")

>>> # audio file is decoded on the fly
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

>>> logits = model(**inputs).logits
>>> predicted_class_ids = torch.argmax(logits, dim=-1).item()
>>> predicted_label = model.config.id2label[predicted_class_ids]

Example using transformers.pipeline:

>>> from transformers import AutoProcessor, pipeline
>>> from optimum.neuron import NeuronModelForAudioClassification

>>> feature_extractor = AutoProcessor.from_pretrained("Jingya/wav2vec2-large-960h-lv60-self-neuronx-audio-classification")
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")

>>> model = NeuronModelForAudioClassification.from_pretrained("Jingya/wav2vec2-large-960h-lv60-self-neuronx-audio-classification")
>>> ac = pipeline("audio-classification", model=model, feature_extractor=feature_extractor)

>>> pred = ac(dataset[0]["audio"]["array"])


class optimum.neuron.NeuronModelForAudioFrameClassification

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with an audio frame classification head.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Neuron Model with a frame classification head on top for tasks like Speaker Diarization.


< >

( input_values: Tensor **kwargs )


  • input_values (torch.Tensor of shape (batch_size, sequence_length)) — Float values of input raw speech waveform.. Input values can be obtained from audio file loaded into an array using AutoProcessor.

The NeuronModelForAudioFrameClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of audio frame classification:

>>> from transformers import AutoProcessor
>>> from optimum.neuron import NeuronModelForAudioFrameClassification
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> feature_extractor = AutoProcessor.from_pretrained("Jingya/wav2vec2-base-superb-sd-neuronx")
>>> model =  NeuronModelForAudioFrameClassification.from_pretrained("Jingya/wav2vec2-base-superb-sd-neuronx")

>>> inputs = feature_extractor(dataset[0]["audio"]["array"], return_tensors="pt", sampling_rate=sampling_rate)
>>> logits = model(**inputs).logits

>>> probabilities = torch.sigmoid(logits[0])
>>> labels = (probabilities > 0.5).long()
>>> labels[0].tolist()


class optimum.neuron.NeuronModelForCTC

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with a connectionist temporal classification head.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Neuron Model with a language modeling head on top for Connectionist Temporal Classification (CTC).


< >

( input_values: Tensor **kwargs )


  • input_values (torch.Tensor of shape (batch_size, sequence_length)) — Float values of input raw speech waveform.. Input values can be obtained from audio file loaded into an array using AutoProcessor.

The NeuronModelForCTC forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of CTC:

>>> from transformers import AutoProcessor, Wav2Vec2ForCTC
>>> from optimum.neuron import NeuronModelForCTC
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> processor = AutoProcessor.from_pretrained("Jingya/wav2vec2-large-960h-lv60-self-neuronx-ctc")
>>> model = NeuronModelForCTC.from_pretrained("Jingya/wav2vec2-large-960h-lv60-self-neuronx-ctc")

>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)

>>> transcription = processor.batch_decode(predicted_ids)

Example using transformers.pipeline:

>>> from transformers import AutoProcessor, pipeline
>>> from optimum.neuron import NeuronModelForCTC

>>> processor = AutoProcessor.from_pretrained("Jingya/wav2vec2-large-960h-lv60-self-neuronx-ctc")
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")

>>> model = NeuronModelForCTC.from_pretrained("Jingya/wav2vec2-large-960h-lv60-self-neuronx-ctc")
>>> asr = pipeline("automatic-speech-recognition", model=model, feature_extractor=processor.feature_extractor, tokenizer=processor.tokenizer)


class optimum.neuron.NeuronModelForXVector

< >

( model: ScriptModule config: PretrainedConfig model_save_dir: Union = None model_file_name: Optional = None preprocessors: Optional = None neuron_config: Optional = None **kwargs )


  • config (transformers.PretrainedConfig) — PretrainedConfig is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the optimum.neuron.modeling.NeuronTracedModel.from_pretrained method to load the model weights.
  • model (torch.jit._script.ScriptModule) — torch.jit._script.ScriptModule is the TorchScript module with embedded NEFF(Neuron Executable File Format) compiled by neuron(x) compiler.

Neuron Model with an XVector feature extraction head on top for tasks like Speaker Verification.

This model inherits from ~neuron.modeling.NeuronTracedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving)

Neuron Model with an XVector feature extraction head on top for tasks like Speaker Verification.


< >

( input_values: Tensor **kwargs )


  • input_values (torch.Tensor of shape (batch_size, sequence_length)) — Float values of input raw speech waveform.. Input values can be obtained from audio file loaded into an array using AutoProcessor.

The NeuronModelForXVector forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of Audio XVector:

>>> from transformers import AutoProcessor
>>> from optimum.neuron import NeuronModelForXVector
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> feature_extractor = AutoProcessor.from_pretrained("Jingya/wav2vec2-base-superb-sv-neuronx")
>>> model = NeuronModelForXVector.from_pretrained("Jingya/wav2vec2-base-superb-sv-neuronx")

>>> inputs = feature_extractor(
...     [d["array"] for d in dataset[:2]["audio"]], sampling_rate=sampling_rate, return_tensors="pt", padding=True
... )
>>> embeddings = model(**inputs).embeddings

>>> embeddings = torch.nn.functional.normalize(embeddings, dim=-1)

>>> cosine_sim = torch.nn.CosineSimilarity(dim=-1)
>>> similarity = cosine_sim(embeddings[0], embeddings[1])
>>> threshold = 0.7
>>> if similarity < threshold:
...     print("Speakers are not the same!")
>>> round(similarity.item(), 2)

Stable Diffusion

The following Neuron model classes are available for stable diffusion tasks.


class optimum.neuron.NeuronStableDiffusionPipeline

< >

( text_encoder: ScriptModule unet: ScriptModule vae_decoder: Union config: Dict configs: Dict neuron_configs: Dict tokenizer: CLIPTokenizer scheduler: Union data_parallel_mode: Literal vae_encoder: Union = None text_encoder_2: Union = None tokenizer_2: Optional = None feature_extractor: Optional = None controlnet: Union = None model_save_dir: Union = None model_and_config_save_paths: Optional = None )


< >

( prompt: Union = None num_inference_steps: int = 50 guidance_scale: float = 7.5 negative_prompt: Union = None num_images_per_prompt: int = 1 eta: float = 0.0 generator: Union = None latents: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None output_type: Optional = 'pil' return_dict: bool = True callback: Optional = None callback_steps: int = 1 cross_attention_kwargs: Optional = None guidance_rescale: float = 0.0 ) diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput or tuple


  • prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide image generation. If not defined, you need to pass prompt_embeds.
  • num_inference_steps (int, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • guidance_scale (float, defaults to 7.5) — A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.
  • negative_prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to pass negative_prompt_embeds instead. Ignored when not using guidance (guidance_scale < 1).
  • num_images_per_prompt (int, defaults to 1) — The number of images to generate per prompt. If it is different from the batch size used for the compiltaion, it will be overriden by the static batch size of neuron (except for dynamic batching).
  • eta (float, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the diffusers.schedulers.DDIMScheduler, and is ignored in other schedulers.
  • generator (Optional[Union[torch.Generator, List[torch.Generator]]], defaults to None) — A torch.Generator to make generation deterministic.
  • latents (Optional[torch.FloatTensor], defaults to None) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.
  • prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the prompt input argument.
  • negative_prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, negative_prompt_embeds are generated from the negative_prompt input argument.
  • output_type (Optional[str], defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
  • return_dict (bool, defaults to True) — Whether or not to return a diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput instead of a plain tuple.
  • callback (Optional[Callable], defaults to None) — A function that calls every callback_steps steps during inference. The function is called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).
  • callback_steps (int, defaults to 1) — The frequency at which the callback function is called. If not specified, the callback is called at every step.
  • cross_attention_kwargs (dict, defaults to None) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined in self.processor.
  • guidance_rescale (float, defaults to 0.0) — Guidance rescale factor from Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR.


diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput or tuple

If return_dict is True, diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.


>>> from optimum.neuron import NeuronStableDiffusionPipeline

>>> compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}
>>> input_shapes = {"batch_size": 1, "height": 512, "width": 512}

>>> stable_diffusion = NeuronStableDiffusionPipeline.from_pretrained(
...     "runwayml/stable-diffusion-v1-5", export=True, **compiler_args, **input_shapes
... )
>>> stable_diffusion.save_pretrained("sd_neuron/")

>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> image = stable_diffusion(prompt).images[0]


class optimum.neuron.NeuronStableDiffusionImg2ImgPipeline

< >

( text_encoder: ScriptModule unet: ScriptModule vae_decoder: Union config: Dict configs: Dict neuron_configs: Dict tokenizer: CLIPTokenizer scheduler: Union data_parallel_mode: Literal vae_encoder: Union = None text_encoder_2: Union = None tokenizer_2: Optional = None feature_extractor: Optional = None controlnet: Union = None model_save_dir: Union = None model_and_config_save_paths: Optional = None )


< >

( prompt: Union = None image: Optional = None strength: float = 0.8 num_inference_steps: int = 50 guidance_scale: float = 7.5 negative_prompt: Union = None num_images_per_prompt: int = 1 eta: float = 0.0 generator: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None output_type: str = 'pil' return_dict: bool = True callback: Optional = None callback_steps: int = 1 cross_attention_kwargs: Optional = None ) diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput or tuple


  • prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide image generation. If not defined, you need to pass prompt_embeds.
  • image (Optional["PipelineImageInput"], defaults to None) — Image, numpy array or tensor representing an image batch to be used as the starting point. For both numpy array and pytorch tensor, the expected value range is between [0, 1] If it’s a tensor or a list or tensors, the expected shape should be (B, C, H, W) or (C, H, W). If it is a numpy array or a list of arrays, the expected shape should be (B, H, W, C) or (H, W, C) It can also accept image latents as image, but if passing latents directly it is not encoded again.
  • strength (float, defaults to 0.8) — Indicates extent to transform the reference image. Must be between 0 and 1. image is used as a starting point and more noise is added the higher the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, added noise is maximum and the denoising process runs for the full number of iterations specified in num_inference_steps. A value of 1 essentially ignores image.
  • num_inference_steps (int, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. This parameter is modulated by strength.
  • guidance_scale (float, defaults to 7.5) — A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.
  • negative_prompt (Optional[Union[str, List[str], defaults to None) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to pass negative_prompt_embeds instead. Ignored when not using guidance (guidance_scale < 1).
  • num_images_per_prompt (int, defaults to 1) — The number of images to generate per prompt. If it is different from the batch size used for the compiltaion, it will be overriden by the static batch size of neuron (except for dynamic batching).
  • eta (float, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the diffusers.schedulers.DDIMScheduler, and is ignored in other schedulers.
  • generator (Optional[Union[torch.Generator, List[torch.Generator]]], defaults to None) — A torch.Generator to make generation deterministic.
  • prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the prompt input argument.
  • negative_prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, negative_prompt_embeds are generated from the negative_prompt input argument.
  • output_type (Optional[str], defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
  • return_dict (bool, defaults to True) — Whether or not to return a diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput instead of a plain tuple.
  • callback (Optional[Callable], defaults to None) — A function that calls every callback_steps steps during inference. The function is called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).
  • callback_steps (int, defaults to 1) — The frequency at which the callback function is called. If not specified, the callback is called at every step.
  • cross_attention_kwargs (dict, defaults to None) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined in self.processor.


diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput or tuple

If return_dict is True, diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.


>>> from optimum.neuron import NeuronStableDiffusionImg2ImgPipeline
>>> from diffusers.utils import load_image

>>> url = ""
>>> init_image = load_image(url).convert("RGB")

>>> compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}
>>> input_shapes = {"batch_size": 1, "height": 512, "width": 512}
>>> pipeline = NeuronStableDiffusionImg2ImgPipeline.from_pretrained(
...     "nitrosocke/Ghibli-Diffusion", export=True, **compiler_args, **input_shapes,
... )
>>> pipeline.save_pretrained("sd_img2img/")

>>> prompt = "ghibli style, a fantasy landscape with snowcapped mountains, trees, lake with detailed reflection."
>>> image = pipeline(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images[0]


class optimum.neuron.NeuronStableDiffusionInpaintPipeline

< >

( text_encoder: ScriptModule unet: ScriptModule vae_decoder: Union config: Dict configs: Dict neuron_configs: Dict tokenizer: CLIPTokenizer scheduler: Union data_parallel_mode: Literal vae_encoder: Union = None text_encoder_2: Union = None tokenizer_2: Optional = None feature_extractor: Optional = None controlnet: Union = None model_save_dir: Union = None model_and_config_save_paths: Optional = None )


< >

( prompt: Union = None image: Optional = None mask_image: Optional = None masked_image_latents: Optional = None strength: float = 1.0 num_inference_steps: int = 50 guidance_scale: float = 7.5 negative_prompt: Union = None num_images_per_prompt: Optional = 1 eta: float = 0.0 generator: Union = None latents: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None output_type: Optional = 'pil' return_dict: bool = True callback: Optional = None callback_steps: int = 1 cross_attention_kwargs: Optional = None clip_skip: int = None ) diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput or tuple


  • prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide image generation. If not defined, you need to pass prompt_embeds.
  • image (Optional["PipelineImageInput"], defaults to None) — Image, numpy array or tensor representing an image batch to be inpainted (which parts of the image to be masked out with mask_image and repainted according to prompt). For both numpy array and pytorch tensor, the expected value range is between [0, 1] If it’s a tensor or a list or tensors, the expected shape should be (B, C, H, W) or (C, H, W). If it is a numpy array or a list of arrays, the expected shape should be (B, H, W, C) or (H, W, C) It can also accept image latents as image, but if passing latents directly it is not encoded again.
  • mask_image (Optional["PipelineImageInput"], defaults to None) — Image, numpy array or tensor representing an image batch to mask image. White pixels in the mask are repainted while black pixels are preserved. If mask_image is a PIL image, it is converted to a single channel (luminance) before use. If it’s a numpy array or pytorch tensor, it should contain one color channel (L) instead of 3, so the expected shape for pytorch tensor would be (B, 1, H, W), (B, H, W), (1, H, W), (H, W). And for numpy array would be for (B, H, W, 1), (B, H, W), (H, W, 1), or (H, W).
  • strength (float, defaults to 1.0) — Indicates extent to transform the reference image. Must be between 0 and 1. image is used as a starting point and more noise is added the higher the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, added noise is maximum and the denoising process runs for the full number of iterations specified in num_inference_steps. A value of 1 essentially ignores image.
  • num_inference_steps (int, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. This parameter is modulated by strength.
  • guidance_scale (float, defaults to 7.5) — A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.
  • negative_prompt (Optional[Union[str, List[str], defaults to None) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to pass negative_prompt_embeds instead. Ignored when not using guidance (guidance_scale < 1).
  • num_images_per_prompt (int, defaults to 1) — The number of images to generate per prompt. If it is different from the batch size used for the compiltaion, it will be overriden by the static batch size of neuron (except for dynamic batching).
  • eta (float, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the diffusers.schedulers.DDIMScheduler, and is ignored in other schedulers.
  • generator (Optional[Union[torch.Generator, List[torch.Generator]]], defaults to None) — A torch.Generator to make generation deterministic.
  • latents (Optional[torch.FloatTensor], defaults to None) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.
  • prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the prompt input argument.
  • negative_prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, negative_prompt_embeds are generated from the negative_prompt input argument.
  • output_type (Optional[str], defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
  • return_dict (bool, defaults to True) — Whether or not to return a diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput instead of a plain tuple.
  • callback (Optional[Callable], defaults to None) — A function that calls every callback_steps steps during inference. The function is called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).
  • callback_steps (int, defaults to 1) — The frequency at which the callback function is called. If not specified, the callback is called at every step.
  • cross_attention_kwargs (dict, defaults to None) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined in self.processor.
  • clip_skip (int, defaults to None) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.


diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput or tuple

If return_dict is True, diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.


>>> from optimum.neuron import NeuronStableDiffusionInpaintPipeline
>>> from diffusers.utils import load_image

>>> img_url = ""
>>> mask_url = ""

>>> init_image = load_image(img_url).convert("RGB")
>>> mask_image = load_image(mask_url).convert("RGB")

>>> compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}
>>> input_shapes = {"batch_size": 1, "height": 1024, "width": 1024}
>>> pipeline = NeuronStableDiffusionInpaintPipeline.from_pretrained(
...     "runwayml/stable-diffusion-inpainting", export=True, **compiler_args, **input_shapes,
... )
>>> pipeline.save_pretrained("sd_inpaint/")

>>> prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
>>> image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image).images[0]


class optimum.neuron.NeuronLatentConsistencyModelPipeline

< >

( text_encoder: ScriptModule unet: ScriptModule vae_decoder: Union config: Dict configs: Dict neuron_configs: Dict tokenizer: CLIPTokenizer scheduler: Union data_parallel_mode: Literal vae_encoder: Union = None text_encoder_2: Union = None tokenizer_2: Optional = None feature_extractor: Optional = None controlnet: Union = None model_save_dir: Union = None model_and_config_save_paths: Optional = None )


< >

( prompt: Union = None num_inference_steps: int = 50 original_inference_steps: Optional = None guidance_scale: float = 8.5 num_images_per_prompt: int = 1 generator: Union = None latents: Optional = None prompt_embeds: Optional = None output_type: str = 'pil' return_dict: bool = True cross_attention_kwargs: Optional = None clip_skip: Optional = None callback_on_step_end: Optional = None callback_on_step_end_tensor_inputs: List = ['latents'] ) diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput or tuple


  • prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide image generation. If not defined, you need to pass prompt_embeds.
  • num_inference_steps (int, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • original_inference_steps (Optional[int], defaults to None) — The original number of inference steps use to generate a linearly-spaced timestep schedule, from which we will draw num_inference_steps evenly spaced timesteps from as our final timestep schedule, following the Skipping-Step method in the paper (see Section 4.3). If not set this will default to the scheduler’s original_inference_steps attribute.
  • guidance_scale (float, defaults to 8.5) — A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1. Note that the original latent consistency models paper uses a different CFG formulation where the guidance scales are decreased by 1 (so in the paper formulation CFG is enabled when guidance_scale > 0).
  • num_images_per_prompt (int, defaults to 1) — The number of images to generate per prompt.
  • generator (Optional[Union[torch.Generator, List[torch.Generator]]], defaults to None) — A torch.Generator to make generation deterministic.
  • latents (Optional[torch.FloatTensor], defaults to None) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.
  • prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the prompt input argument.
  • output_type (str, defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
  • return_dict (bool, defaults to True) — Whether or not to return a ~pipelines.stable_diffusion.StableDiffusionPipelineOutput instead of a plain tuple.
  • cross_attention_kwargs (Optional[Dict[str, Any]], defaults to None) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined in self.processor.
  • clip_skip (Optional[int], defaults to None) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.
  • callback_on_step_end (Optional[Callable], defaults to None) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
  • callback_on_step_end_tensor_inputs (List[str], defaults to ["latents"]) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeine class.


diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput or tuple

If return_dict is True, diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.



class optimum.neuron.NeuronStableDiffusionControlNetPipeline

< >

( text_encoder: ScriptModule unet: ScriptModule vae_decoder: Union config: Dict configs: Dict neuron_configs: Dict tokenizer: CLIPTokenizer scheduler: Union data_parallel_mode: Literal vae_encoder: Union = None text_encoder_2: Union = None tokenizer_2: Optional = None feature_extractor: Optional = None controlnet: Union = None model_save_dir: Union = None model_and_config_save_paths: Optional = None )


< >

( prompt: Union = None image: Union = None num_inference_steps: int = 50 timesteps: Optional = None sigmas: Optional = None guidance_scale: float = 7.5 negative_prompt: Union = None num_images_per_prompt: Optional = 1 eta: float = 0.0 generator: Union = None latents: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None ip_adapter_image: Union = None ip_adapter_image_embeds: Optional = None output_type: str = 'pil' return_dict: bool = True cross_attention_kwargs: Optional = None controlnet_conditioning_scale: Union = 1.0 guess_mode: bool = False control_guidance_start: Union = 0.0 control_guidance_end: Union = 1.0 clip_skip: Optional = None callback_on_step_end: Union = None callback_on_step_end_tensor_inputs: List = ['latents'] **kwargs ) diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput or tuple


  • prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide image generation. If not defined, you need to pass prompt_embeds.
  • image (Optional["PipelineImageInput"], defaults to None) — The ControlNet input condition to provide guidance to the unet for generation. If the type is specified as torch.Tensor, it is passed to ControlNet as is. PIL.Image.Image can also be accepted as an image. The dimensions of the output image defaults to image’s dimensions. If height and/or width are passed, image is resized accordingly. If multiple ControlNets are specified in init, images must be passed as a list such that each element of the list can be correctly batched for input to a single ControlNet. When prompt is a list, and if a list of images is passed for a single ControlNet, each will be paired with each prompt in the prompt list. This also applies to multiple ControlNets, where a list of image lists can be passed to batch for each prompt and each ControlNet.
  • num_inference_steps (int, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • timesteps (Optional[List[int]], defaults to None) — Custom timesteps to use for the denoising process with schedulers which support a timesteps argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used. Must be in descending order.
  • sigmas (Optional[List[int]], defaults to None) — Custom sigmas to use for the denoising process with schedulers which support a sigmas argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used.
  • guidance_scale (float, defaults to 7.5) — A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.
  • negative_prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to pass negative_prompt_embeds instead. Ignored when not using guidance (guidance_scale < 1).
  • num_images_per_prompt (int, defaults to 1) — The number of images to generate per prompt. If it is different from the batch size used for the compiltaion, it will be overriden by the static batch size of neuron (except for dynamic batching).
  • eta (float, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the diffusers.schedulers.DDIMScheduler, and is ignored in other schedulers.
  • generator (Optional[Union[torch.Generator, List[torch.Generator]]], defaults to None) — A torch.Generator to make generation deterministic.
  • latents (Optional[torch.Tensor], defaults to None) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.
  • prompt_embeds (Optional[torch.Tensor], defaults to None) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the prompt input argument.
  • negative_prompt_embeds (Optional[torch.Tensor], defaults to None) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, negative_prompt_embeds are generated from the negative_prompt input argument. ip_adapter_image — (Optional[PipelineImageInput], defaults to None): Optional image input to work with IP Adapters.
  • ip_adapter_image_embeds (Optional[List[torch.Tensor]], defaults to None) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape (batch_size, num_images, emb_dim). It should contain the negative image embedding if do_classifier_free_guidance is set to True. If not provided, embeddings are computed from the ip_adapter_image input argument.
  • output_type (str, defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
  • return_dict (bool, defaults to True) — Whether or not to return a diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput instead of a plain tuple.
  • cross_attention_kwargs (Optional[Dict[str, Any]], defaults to None) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined in self.processor.
  • controlnet_conditioning_scale (Union[float, List[float]], defaults to 1.0) — The outputs of the ControlNet are multiplied by controlnet_conditioning_scale before they are added to the residual in the original unet. If multiple ControlNets are specified in init, you can set the corresponding scale as a list.
  • guess_mode (bool, defaults to False) — The ControlNet encoder tries to recognize the content of the input image even if you remove all prompts. A guidance_scale value between 3.0 and 5.0 is recommended.
  • control_guidance_start (Union[float, List[float]], defaults to 0.0) — The percentage of total steps at which the ControlNet starts applying.
  • control_guidance_end (Union[float, List[float]], optional, defaults to 1.0) — The percentage of total steps at which the ControlNet stops applying.
  • clip_skip (Optional[int], defaults to None) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.
  • callback_on_step_end (Optional[Union[Callable[[int, int, Dict], None], PipelineCallback, MultiPipelineCallbacks]], defaults to None) — A function or a subclass of PipelineCallback or MultiPipelineCallbacks that is called at the end of each denoising step during the inference. with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
  • callback_on_step_end_tensor_inputs (List[str], defaults to ["latents"]) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.


diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput or tuple

If return_dict is True, diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.


class optimum.neuron.NeuronStableDiffusionXLPipeline

< >

( text_encoder: ScriptModule unet: ScriptModule vae_decoder: ScriptModule config: Dict tokenizer: CLIPTokenizer scheduler: Union data_parallel_mode: Literal vae_encoder: Optional = None text_encoder_2: Optional = None tokenizer_2: Optional = None feature_extractor: Optional = None controlnet: Union = None configs: Optional = None neuron_configs: Optional = None model_save_dir: Union = None model_and_config_save_paths: Optional = None add_watermarker: Optional = None )


< >

( prompt: Union = None prompt_2: Union = None num_inference_steps: int = 50 denoising_end: Optional = None guidance_scale: float = 5.0 negative_prompt: Union = None negative_prompt_2: Union = None num_images_per_prompt: int = 1 eta: float = 0.0 generator: Union = None latents: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None pooled_prompt_embeds: Optional = None negative_pooled_prompt_embeds: Optional = None output_type: Optional = 'pil' return_dict: bool = True callback: Optional = None callback_steps: int = 1 cross_attention_kwargs: Optional = None guidance_rescale: float = 0.0 original_size: Optional = None crops_coords_top_left: Tuple = (0, 0) target_size: Optional = None negative_original_size: Optional = None negative_crops_coords_top_left: Tuple = (0, 0) negative_target_size: Optional = None clip_skip: Optional = None ) diffusers.pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput or tuple


  • prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
  • prompt_2 (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to be sent to the tokenizer_2 and text_encoder_2. If not defined, prompt is used in both text-encoders
  • num_inference_steps (int, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • denoising_end (Optional[float], defaults to None) — When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be completed before it is intentionally prematurely terminated. As a result, the returned sample will still retain a substantial amount of noise as determined by the discrete timesteps selected by the scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a “Mixture of Denoisers” multi-pipeline setup, as elaborated in Refining the Image Output
  • guidance_scale (float, defaults to 5.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
  • negative_prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • negative_prompt_2 (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts not to guide the image generation to be sent to tokenizer_2 and text_encoder_2. If not defined, negative_prompt is used in both text-encoders
  • num_images_per_prompt (int, defaults to 1) — The number of images to generate per prompt. If it is different from the batch size used for the compiltaion, it will be overriden by the static batch size of neuron (except for dynamic batching).
  • eta (float, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: Only applies to schedulers.DDIMScheduler, will be ignored for others.
  • generator (Optional[Union[torch.Generator, List[torch.Generator]]], defaults to None) — One or a list of torch generator(s) to make generation deterministic.
  • latents (Optional[torch.FloatTensor], defaults to None) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.
  • prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • pooled_prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated from prompt input argument.
  • negative_pooled_prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated from negative_prompt input argument.
  • output_type (Optional[str], defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
  • return_dict (bool, defaults to True) — Whether or not to return a diffusers.pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput instead of a plain tuple.
  • callback (Optional[Callable], defaults to None) — A function that will be called every callback_steps steps during inference. The function will be called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).
  • callback_steps (int, defaults to 1) — The frequency at which the callback function will be called. If not specified, the callback will be called at every step.
  • cross_attention_kwargs (dict, defaults to None) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
  • guidance_rescale (float, optional, defaults to 0.0) — Guidance rescale factor proposed by Common Diffusion Noise Schedules and Sample Steps are Flawed guidance_scale is defined as φ in equation 16. of Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR.
  • original_size (Optional[Tuple[int, int]], defaults to (1024, 1024)) — If original_size is not the same as target_size the image will appear to be down- or upsampled. original_size defaults to (width, height) if not specified. Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • crops_coords_top_left (Tuple[int], defaults to (0, 0)) — crops_coords_top_left can be used to generate an image that appears to be “cropped” from the position crops_coords_top_left downwards. Favorable, well-centered images are usually achieved by setting crops_coords_top_left to (0, 0). Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • target_size (Tuple[int],defaults to (1024, 1024)) — For most cases, target_size should be set to the desired height and width of the generated image. If not specified it will default to (width, height). Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • negative_original_size (Tuple[int], defaults to (1024, 1024)) — To negatively condition the generation process based on a specific image resolution. Part of SDXL’s micro-conditioning as explained in section 2.2 of For more information, refer to this issue thread:
  • negative_crops_coords_top_left (Tuple[int], defaults to (0, 0)) — To negatively condition the generation process based on a specific crop coordinates. Part of SDXL’s micro-conditioning as explained in section 2.2 of For more information, refer to this issue thread:
  • negative_target_size (Tuple[int], defaults to (1024, 1024)) — To negatively condition the generation process based on a target image resolution. It should be as same as the target_size for most cases. Part of SDXL’s micro-conditioning as explained in section 2.2 of For more information, refer to this issue thread:
  • clip_skip (Optional[int], defaults to None) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.


diffusers.pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput or tuple

diffusers.pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.


>>> from optimum.neuron import NeuronStableDiffusionXLPipeline

>>> compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}
>>> input_shapes = {"batch_size": 1, "height": 1024, "width": 1024}

>>> stable_diffusion_xl = NeuronStableDiffusionXLPipeline.from_pretrained(
...     "stabilityai/stable-diffusion-xl-base-1.0", export=True, **compiler_args, **input_shapes)
... )
>>> stable_diffusion_xl.save_pretrained("sd_neuron_xl/")

>>> prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
>>> image = stable_diffusion_xl(prompt).images[0]


class optimum.neuron.NeuronStableDiffusionXLImg2ImgPipeline

< >

( text_encoder: ScriptModule unet: ScriptModule vae_decoder: ScriptModule config: Dict tokenizer: CLIPTokenizer scheduler: Union data_parallel_mode: Literal vae_encoder: Optional = None text_encoder_2: Optional = None tokenizer_2: Optional = None feature_extractor: Optional = None controlnet: Union = None configs: Optional = None neuron_configs: Optional = None model_save_dir: Union = None model_and_config_save_paths: Optional = None add_watermarker: Optional = None )


< >

( prompt: Union = None prompt_2: Union = None image: Optional = None strength: float = 0.3 num_inference_steps: int = 50 denoising_start: Optional = None denoising_end: Optional = None guidance_scale: float = 5.0 negative_prompt: Union = None negative_prompt_2: Union = None num_images_per_prompt: Optional = 1 eta: float = 0.0 generator: Union = None latents: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None pooled_prompt_embeds: Optional = None negative_pooled_prompt_embeds: Optional = None output_type: Optional = 'pil' return_dict: bool = True callback: Optional = None callback_steps: int = 1 cross_attention_kwargs: Optional = None guidance_rescale: float = 0.0 original_size: Tuple = None crops_coords_top_left: Tuple = (0, 0) target_size: Tuple = None negative_original_size: Optional = None negative_crops_coords_top_left: Tuple = (0, 0) negative_target_size: Optional = None aesthetic_score: float = 6.0 negative_aesthetic_score: float = 2.5 clip_skip: Optional = None ) diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple


  • prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
  • prompt_2 (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to be sent to the tokenizer_2 and text_encoder_2. If not defined, prompt is used in both text-encoders
  • image (Optional["PipelineImageInput"], defaults to None) — The image(s) to modify with the pipeline.
  • strength (float, defaults to 0.3) — Conceptually, indicates how much to transform the reference image. Must be between 0 and 1. image will be used as a starting point, adding more noise to it the larger the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, added noise will be maximum and the denoising process will run for the full number of iterations specified in num_inference_steps. A value of 1, therefore, essentially ignores image. Note that in the case of denoising_start being declared as an integer, the value of strength will be ignored.
  • num_inference_steps (int, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • denoising_start (Optional[float], defaults to None) — When specified, indicates the fraction (between 0.0 and 1.0) of the total denoising process to be bypassed before it is initiated. Consequently, the initial part of the denoising process is skipped and it is assumed that the passed image is a partly denoised image. Note that when this is specified, strength will be ignored. The denoising_start parameter is particularly beneficial when this pipeline is integrated into a “Mixture of Denoisers” multi-pipeline setup, as detailed in Refining the Image Output.
  • denoising_end (Optional[float], defaults to None) — When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be completed before it is intentionally prematurely terminated. As a result, the returned sample will still retain a substantial amount of noise (ca. final 20% of timesteps still needed) and should be denoised by a successor pipeline that has denoising_start set to 0.8 so that it only denoises the final 20% of the scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a “Mixture of Denoisers” multi-pipeline setup, as elaborated in Refining the Image Output.
  • guidance_scale (float, defaults to 7.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
  • negative_prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • negative_prompt_2 (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts not to guide the image generation to be sent to tokenizer_2 and text_encoder_2. If not defined, negative_prompt is used in both text-encoders
  • num_images_per_prompt (int, defaults to 1) — The number of images to generate per prompt. If it is different from the batch size used for the compiltaion, it will be overriden by the static batch size of neuron (except for dynamic batching).
  • eta (float, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: Only applies to schedulers.DDIMScheduler, will be ignored for others.
  • generator (Optional[Union[torch.Generator, List[torch.Generator]]], defaults to None) — One or a list of torch generator(s) to make generation deterministic.
  • latents (Optional[torch.FloatTensor], defaults to None) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.
  • prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • pooled_prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated from prompt input argument.
  • negative_pooled_prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated from negative_prompt input argument.
  • output_type (Optional[str], defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
  • return_dict (bool, defaults to True) — Whether or not to return a diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput instead of a plain tuple.
  • callback (Optional[Callable], defaults to None) — A function that will be called every callback_steps steps during inference. The function will be called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).
  • callback_stcallback_steps (int, defaults to 1) — The frequency at which the callback function will be called. If not specified, the callback will be called at every step.
  • cross_attention_kwargs (Optional[Dict[str, Any]], defaults to None) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
  • guidance_rescale (float, defaults to 0.0) — Guidance rescale factor proposed by Common Diffusion Noise Schedules and Sample Steps are Flawed guidance_scale is defined as φ in equation 16. of Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR.
  • original_size (Optional[Tuple[int, int]], defaults to (1024, 1024)) — If original_size is not the same as target_size the image will appear to be down- or upsampled. original_size defaults to (width, height) if not specified. Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • crops_coords_top_left (Tuple[int], defaults to (0, 0)) — crops_coords_top_left can be used to generate an image that appears to be “cropped” from the position crops_coords_top_left downwards. Favorable, well-centered images are usually achieved by setting crops_coords_top_left to (0, 0). Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • target_size (Tuple[int],defaults to (1024, 1024)) — For most cases, target_size should be set to the desired height and width of the generated image. If not specified it will default to (width, height). Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • negative_original_size (Tuple[int], defaults to (1024, 1024)) — To negatively condition the generation process based on a specific image resolution. Part of SDXL’s micro-conditioning as explained in section 2.2 of For more information, refer to this issue thread:
  • negative_crops_coords_top_left (Tuple[int], defaults to (0, 0)) — To negatively condition the generation process based on a specific crop coordinates. Part of SDXL’s micro-conditioning as explained in section 2.2 of For more information, refer to this issue thread:
  • negative_target_size (Tuple[int], defaults to (1024, 1024)) — To negatively condition the generation process based on a target image resolution. It should be as same as the target_size for most cases. Part of SDXL’s micro-conditioning as explained in section 2.2 of For more information, refer to this issue thread:
  • aesthetic_score (float, defaults to 6.0) — Used to simulate an aesthetic score of the generated image by influencing the positive text condition. Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • negative_aesthetic_score (float, defaults to 2.5) — Part of SDXL’s micro-conditioning as explained in section 2.2 of Can be used to simulate an aesthetic score of the generated image by influencing the negative text condition.
  • clip_skip (Optional[int], defaults to None) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.


diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple

diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput if return_dict is True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.


>>> from optimum.neuron import NeuronStableDiffusionXLImg2ImgPipeline
>>> from diffusers.utils import load_image

>>> url = ""
>>> init_image = load_image(url).convert("RGB")

>>> compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}
>>> input_shapes = {"batch_size": 1, "height": 512, "width": 512}
>>> pipeline = NeuronStableDiffusionXLImg2ImgPipeline.from_pretrained(
...     "stabilityai/stable-diffusion-xl-base-1.0", export=True, **compiler_args, **input_shapes,
... )
>>> pipeline.save_pretrained("sdxl_img2img/")

>>> prompt = "a dog running, lake, moat"
>>> image = pipeline(prompt=prompt, image=init_image).images[0]


class optimum.neuron.NeuronStableDiffusionXLInpaintPipeline

< >

( text_encoder: ScriptModule unet: ScriptModule vae_decoder: ScriptModule config: Dict tokenizer: CLIPTokenizer scheduler: Union data_parallel_mode: Literal vae_encoder: Optional = None text_encoder_2: Optional = None tokenizer_2: Optional = None feature_extractor: Optional = None controlnet: Union = None configs: Optional = None neuron_configs: Optional = None model_save_dir: Union = None model_and_config_save_paths: Optional = None add_watermarker: Optional = None )


< >

( prompt: Union = None prompt_2: Union = None image: Optional = None mask_image: Optional = None masked_image_latents: Optional = None padding_mask_crop: Optional = None strength: float = 0.9999 num_inference_steps: int = 50 timesteps: Optional = None denoising_start: Optional = None denoising_end: Optional = None guidance_scale: float = 7.5 negative_prompt: Union = None negative_prompt_2: Union = None num_images_per_prompt: Optional = 1 eta: float = 0.0 generator: Union = None latents: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None pooled_prompt_embeds: Optional = None negative_pooled_prompt_embeds: Optional = None ip_adapter_image: Union = None output_type: Optional = 'pil' return_dict: bool = True cross_attention_kwargs: Optional = None guidance_rescale: float = 0.0 original_size: Tuple = None crops_coords_top_left: Tuple = (0, 0) target_size: Tuple = None negative_original_size: Optional = None negative_crops_coords_top_left: Tuple = (0, 0) negative_target_size: Optional = None aesthetic_score: float = 6.0 negative_aesthetic_score: float = 2.5 clip_skip: Optional = None callback_on_step_end: Optional = None callback_on_step_end_tensor_inputs: List = ['latents'] **kwargs ) diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple


  • prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
  • prompt_2 (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to be sent to the tokenizer_2 and text_encoder_2. If not defined, prompt is used in both text-encoders
  • image (Optional["PipelineImageInput"], defaults to None) — Image, or tensor representing an image batch which will be inpainted, i.e. parts of the image will be masked out with mask_image and repainted according to prompt.
  • mask_image (Optional["PipelineImageInput"], defaults to None) — Image, or tensor representing an image batch, to mask image. White pixels in the mask will be repainted, while black pixels will be preserved. If mask_image is a PIL image, it will be converted to a single channel (luminance) before use. If it’s a tensor, it should contain one color channel (L) instead of 3, so the expected shape would be (B, H, W, 1).
  • padding_mask_crop (Optional[int], defaults to None) — The size of margin in the crop to be applied to the image and masking. If None, no crop is applied to image and mask_image. If padding_mask_crop is not None, it will first find a rectangular region with the same aspect ration of the image and contains all masked area, and then expand that area based on padding_mask_crop. The image and mask_image will then be cropped based on the expanded area before resizing to the original image size for inpainting. This is useful when the masked area is small while the image is large and contain information inreleant for inpainging, such as background.
  • strength (float, defaults to 0.9999) — Conceptually, indicates how much to transform the masked portion of the reference image. Must be between 0 and 1. image will be used as a starting point, adding more noise to it the larger the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, added noise will be maximum and the denoising process will run for the full number of iterations specified in num_inference_steps. A value of 1, therefore, essentially ignores the masked portion of the reference image. Note that in the case of denoising_start being declared as an integer, the value of strength will be ignored.
  • num_inference_steps (int, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • timesteps (Optional[List[int]], defaults to None) — Custom timesteps to use for the denoising process with schedulers which support a timesteps argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used. Must be in descending order.
  • denoising_start (Optional[float], defaults to None) — When specified, indicates the fraction (between 0.0 and 1.0) of the total denoising process to be bypassed before it is initiated. Consequently, the initial part of the denoising process is skipped and it is assumed that the passed image is a partly denoised image. Note that when this is specified, strength will be ignored. The denoising_start parameter is particularly beneficial when this pipeline is integrated into a “Mixture of Denoisers” multi-pipeline setup, as detailed in Refining the Image Output.
  • denoising_end (Optional[float], defaults to None) — When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be completed before it is intentionally prematurely terminated. As a result, the returned sample will still retain a substantial amount of noise (ca. final 20% of timesteps still needed) and should be denoised by a successor pipeline that has denoising_start set to 0.8 so that it only denoises the final 20% of the scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a “Mixture of Denoisers” multi-pipeline setup, as elaborated in Refining the Image Output.
  • guidance_scale (float, defaults to 7.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
  • negative_prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • negative_prompt_2 (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts not to guide the image generation to be sent to tokenizer_2 and text_encoder_2. If not defined, negative_prompt is used in both text-encoders
  • prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • negative_prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • pooled_prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated from prompt input argument.
  • negative_pooled_prompt_embeds (Optional[torch.FloatTensor], defaults to None) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated from negative_prompt input argument. ip_adapter_image — (Optional[PipelineImageInput], defaults to None): Optional image input to work with IP Adapters.
  • num_images_per_prompt (int, defaults to 1) — The number of images to generate per prompt.
  • eta (float, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: Only applies to schedulers.DDIMScheduler, will be ignored for others.
  • generator (Optional[Union[torch.Generator, List[torch.Generator]]], defaults to None) — One or a list of torch generator(s) to make generation deterministic.
  • latents (Optional[torch.FloatTensor], defaults to None) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.
  • output_type (Optional[str], defaults to "pil") — The output format of the generate image. Choose between PIL: PIL.Image.Image or np.array.
  • return_dict (bool, defaults to True) — Whether or not to return a ~pipelines.stable_diffusion.StableDiffusionPipelineOutput instead of a plain tuple.
  • cross_attention_kwargs (Optional[Dict[str, Any]], defaults to None) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
  • original_size (Tuple[int], defaults to (1024, 1024)) — If original_size is not the same as target_size the image will appear to be down- or upsampled. original_size defaults to (height, width) if not specified. Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • crops_coords_top_left (Tuple[int], defaults to (0, 0)) — crops_coords_top_left can be used to generate an image that appears to be “cropped” from the position crops_coords_top_left downwards. Favorable, well-centered images are usually achieved by setting crops_coords_top_left to (0, 0). Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • target_size (Tuple[int], defaults to (1024, 1024)) — For most cases, target_size should be set to the desired height and width of the generated image. If not specified it will default to (height, width). Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • negative_original_size (Tuple[int], defaults to (1024, 1024)) — To negatively condition the generation process based on a specific image resolution. Part of SDXL’s micro-conditioning as explained in section 2.2 of For more information, refer to this issue thread:
  • negative_crops_coords_top_left (Tuple[int], defaults to (0, 0)) — To negatively condition the generation process based on a specific crop coordinates. Part of SDXL’s micro-conditioning as explained in section 2.2 of For more information, refer to this issue thread:
  • negative_target_size (Tuple[int], defaults to (1024, 1024)) — To negatively condition the generation process based on a target image resolution. It should be as same as the target_size for most cases. Part of SDXL’s micro-conditioning as explained in section 2.2 of For more information, refer to this issue thread:
  • aesthetic_score (float, defaults to 6.0) — Used to simulate an aesthetic score of the generated image by influencing the positive text condition. Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • negative_aesthetic_score (float, defaults to 2.5) — Part of SDXL’s micro-conditioning as explained in section 2.2 of Can be used to simulate an aesthetic score of the generated image by influencing the negative text condition.
  • clip_skip (Optional[int], defaults to None) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.
  • callback_on_step_end (Optional[Callable[[int, int, Dict], None]], defaults to None) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
  • callback_on_step_end_tensor_inputs (List[str], defaults to [“latents”]) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.


diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple

diffusers.pipelines.stable_diffusion.StableDiffusionXLPipelineOutput if return_dict is True, otherwise a tuple. tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.


>>> from optimum.neuron import NeuronStableDiffusionXLInpaintPipeline
>>> from diffusers.utils import load_image

>>> img_url = "" (
>>> mask_url = ""

>>> init_image = load_image(img_url).convert("RGB")
>>> mask_image = load_image(mask_url).convert("RGB")

>>> compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}
>>> input_shapes = {"batch_size": 1, "height": 1024, "width": 1024}
>>> pipeline = NeuronStableDiffusionXLInpaintPipeline.from_pretrained(
...     "stabilityai/stable-diffusion-xl-base-1.0", export=True, **compiler_args, **input_shapes,
... )
>>> pipeline.save_pretrained("sdxl_inpaint/")

>>> prompt = "A deep sea diver floating"
>>> image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0]


class optimum.neuron.NeuronStableDiffusionXLControlNetPipeline

< >

( text_encoder: ScriptModule unet: ScriptModule vae_decoder: ScriptModule config: Dict tokenizer: CLIPTokenizer scheduler: Union data_parallel_mode: Literal vae_encoder: Optional = None text_encoder_2: Optional = None tokenizer_2: Optional = None feature_extractor: Optional = None controlnet: Union = None configs: Optional = None neuron_configs: Optional = None model_save_dir: Union = None model_and_config_save_paths: Optional = None add_watermarker: Optional = None )


< >

( prompt: Union = None prompt_2: Union = None image: Union = None num_inference_steps: int = 50 timesteps: List = None sigmas: List = None denoising_end: Optional = None guidance_scale: float = 5.0 negative_prompt: Union = None negative_prompt_2: Union = None num_images_per_prompt: Optional = 1 eta: float = 0.0 generator: Union = None latents: Optional = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None pooled_prompt_embeds: Optional = None negative_pooled_prompt_embeds: Optional = None ip_adapter_image: Union = None ip_adapter_image_embeds: Optional = None output_type: Optional = 'pil' return_dict: bool = True cross_attention_kwargs: Optional = None controlnet_conditioning_scale: Union = 1.0 guess_mode: bool = False control_guidance_start: Union = 0.0 control_guidance_end: Union = 1.0 original_size: Optional = None crops_coords_top_left: Tuple = (0, 0) target_size: Optional = None negative_original_size: Optional = None negative_crops_coords_top_left: Tuple = (0, 0) negative_target_size: Optional = None clip_skip: Optional = None callback_on_step_end: Union = None callback_on_step_end_tensor_inputs: List = ['latents'] **kwargs ) diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput or tuple


  • prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide image generation. If not defined, you need to pass prompt_embeds.
  • prompt_2 (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to be sent to tokenizer_2 and text_encoder_2. If not defined, prompt is used in both text-encoders.
  • image (Optional["PipelineImageInput"], defaults to None) — The ControlNet input condition to provide guidance to the unet for generation. If the type is specified as torch.Tensor, it is passed to ControlNet as is. PIL.Image.Image can also be accepted as an image. The dimensions of the output image defaults to image’s dimensions. If height and/or width are passed, image is resized accordingly. If multiple ControlNets are specified in init, images must be passed as a list such that each element of the list can be correctly batched for input to a single ControlNet.
  • num_inference_steps (int, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • timesteps (Optional[List[int]], defaults to None) — Custom timesteps to use for the denoising process with schedulers which support a timesteps argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used. Must be in descending order.
  • sigmas (Optional[List[int]], defaults to None) — Custom sigmas to use for the denoising process with schedulers which support a sigmas argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used.
  • denoising_end (Optional[float], defaults to None) — When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be completed before it is intentionally prematurely terminated. As a result, the returned sample will still retain a substantial amount of noise as determined by the discrete timesteps selected by the scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a “Mixture of Denoisers” multi-pipeline setup, as elaborated in Refining the Image Output
  • guidance_scale (float, defaults to 5.0) — A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.
  • negative_prompt (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide what to not include in image generation. If not defined, you need to pass negative_prompt_embeds instead. Ignored when not using guidance (guidance_scale < 1).
  • negative_prompt_2 (Optional[Union[str, List[str]]], defaults to None) — The prompt or prompts to guide what to not include in image generation. This is sent to tokenizer_2 and text_encoder_2. If not defined, negative_prompt is used in both text-encoders.
  • num_images_per_prompt (int, defaults to 1) — The number of images to generate per prompt.
  • eta (float, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the diffusers.schedulers.DDIMScheduler, and is ignored in other schedulers.
  • generator (Optional[Union[torch.Generator, List[torch.Generator]]], defaults to None) — A torch.Generator to make generation deterministic.
  • latents (Optional[torch.Tensor], defaults to None) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.
  • prompt_embeds (Optional[torch.Tensor], defaults to None) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the prompt input argument.
  • negative_prompt_embeds (Optional[torch.Tensor], defaults to None) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, negative_prompt_embeds are generated from the negative_prompt input argument.
  • pooled_prompt_embeds (Optional[torch.Tensor], defaults to None) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, pooled text embeddings are generated from prompt input argument.
  • negative_pooled_prompt_embeds (Optional[torch.Tensor], defaults to None) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, pooled negative_prompt_embeds are generated from negative_prompt input argument. ip_adapter_image — (Optional[PipelineImageInput], defaults to None): Optional image input to work with IP Adapters.
  • ip_adapter_image_embeds (Optional[List[torch.Tensor]], defaults to None) — Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Each element should be a tensor of shape (batch_size, num_images, emb_dim). It should contain the negative image embedding if do_classifier_free_guidance is set to True. If not provided, embeddings are computed from the ip_adapter_image input argument.
  • output_type (Optional[str], defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.
  • return_dict (bool, defaults to True) — Whether or not to return a ~pipelines.stable_diffusion.StableDiffusionPipelineOutput instead of a plain tuple.
  • cross_attention_kwargs (Optional[Dict[str, Any]], defaults to None) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined in self.processor.
  • controlnet_conditioning_scale (Union[float, List[float]], defaults to 1.0) — The outputs of the ControlNet are multiplied by controlnet_conditioning_scale before they are added to the residual in the original unet. If multiple ControlNets are specified in init, you can set the corresponding scale as a list.
  • guess_mode (bool, defaults to False) — The ControlNet encoder tries to recognize the content of the input image even if you remove all prompts. A guidance_scale value between 3.0 and 5.0 is recommended.
  • control_guidance_start (Union[float, List[float]], defaults to 0.0) — The percentage of total steps at which the ControlNet starts applying.
  • control_guidance_end (Union[float, List[float]], defaults to 1.0) — The percentage of total steps at which the ControlNet stops applying.
  • original_size (Optional[Tuple[int, int]], defaults to (1024, 1024)) — If original_size is not the same as target_size the image will appear to be down- or upsampled. original_size defaults to (height, width) if not specified. Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • crops_coords_top_left (Tuple[int, int], defaults to (0, 0)) — crops_coords_top_left can be used to generate an image that appears to be “cropped” from the position crops_coords_top_left downwards. Favorable, well-centered images are usually achieved by setting crops_coords_top_left to (0, 0). Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • target_size (Optional[Tuple[int, int]], defaults to None) — For most cases, target_size should be set to the desired height and width of the generated image. If not specified it will default to (height, width). Part of SDXL’s micro-conditioning as explained in section 2.2 of
  • negative_original_size (Optional[Tuple[int, int]], defaults to None) — To negatively condition the generation process based on a specific image resolution. Part of SDXL’s micro-conditioning as explained in section 2.2 of For more information, refer to this issue thread:
  • negative_crops_coords_top_left (Tuple[int, int], defaults to (0, 0)) — To negatively condition the generation process based on a specific crop coordinates. Part of SDXL’s micro-conditioning as explained in section 2.2 of For more information, refer to this issue thread:
  • negative_target_size (Optional[Tuple[int, int]], defaults to None) — To negatively condition the generation process based on a target image resolution. It should be as same as the target_size for most cases. Part of SDXL’s micro-conditioning as explained in section 2.2 of For more information, refer to this issue thread:
  • clip_skip (Optional[int], defaults to None) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.
  • callback_on_step_end (Optional[Union[Callable[[int, int, Dict], None], PipelineCallback, MultiPipelineCallbacks]], defaults to None) — A function or a subclass of PipelineCallback or MultiPipelineCallbacks that is called at the end of each denoising step during the inference. with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified by callback_on_step_end_tensor_inputs.
  • callback_on_step_end_tensor_inputs (List[str], defaults to ["latents"]) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the ._callback_tensor_inputs attribute of your pipeline class.


diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput or tuple

If return_dict is True, diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput is returned, otherwise a tuple is returned containing the output images.

The call function to the pipeline for generation.
