Optimum documentation

Inference

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Inference

Optimum Intel can be used to load models from the Hub and create pipelines to run inference with IPEX optimizations (including patching with custom operators, weight prepacking and graph mode) on a variety of Intel processors. For now support is only enabled for CPUs.

Loading

You can load your model and apply IPEX optimizations (including weight prepacking and graph mode). For supported architectures like LLaMA, BERT and ViT, further optimizations will be applied by patching the model to use custom operators. For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future torch.compile will be used and model exported via TorchScript will get deprecated.

  import torch
  from transformers import AutoTokenizer, pipeline
- from transformers import AutoModelForCausalLM
+ from optimum.intel import IPEXModelForCausalLM

  model_id = "gpt2"
- model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
+ model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, export=True)
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
  results = pipe("He's a dreadful magician and")

As shown in the table below, each task is associated with a class enabling to automatically load your model.

Auto Class Task
IPEXModelForSequenceClassification text-classification
IPEXModelForTokenClassification token-classification
IPEXModelForQuestionAnswering question-answering
IPEXModelForImageClassification image-classification
IPEXModel feature-extraction
IPEXModelForMaskedLM fill-mask
IPEXModelForAudioClassification audio-classification
IPEXModelForCausalLM text-generation
< > Update on GitHub