Using Optimum Neuron on Amazon SageMaker
Optimum Neuron is integrated into Amazon SageMaker through the Hugging Face Deep Learning Containers for AWS Accelerators like Inferentia2 and Trainium1. This allows you to easily train and deploy 🤗 Transformers and Diffusers models on Amazon SageMaker leveraging AWS accelerators.
The Hugging Face DLC images come with pre-installed Optimum Neuron and tools to compile models for efficient inference on Inferentia2 and Trainium1. This makes deploying large transformer models simple and optimized out of the box.
Below is a list of available end-to-end tutorials on using Optimum Neuron via the Hugging Face DLC to train and deploy models on Amazon SageMaker. Follow the end-to-end examples to learn how Optimum Neuron integrates with SageMaker through the Hugging Face DLC images to unlock performance and cost benefits.
Deploy Embedding Models on Inferentia2 for Efficient Similarity Search
Tutorial on how to deploy a text embedding model (BGE-Base) for efficient and fast embedding generation on AWS Inferentia2 using Amazon SageMaker; The post shows how Inferentia2 can be a great option for not only efficient and fast but also cost-effective inference of embeddings compared to GPUs or services like OpenAI and Amazon Bedrock.
Deploy Llama 2 7B on AWS inferentia2 with Amazon SageMaker
Tutorial on how to deploy the conversational Llama 2 model on AWS Inferentia2 using Amazon SageMaker for low-latency inference; Shows how to leverage Inferentia2 and SageMaker to go from model training to production deployment with just a few lines of code.
Deploy Stable Diffusion XL on AWS inferentia2 with Amazon SageMaker
Tutorial on how to deploy Stable Diffusion XL model on AWS Inferentia2 using Optimum Neuron and Amazon SageMaker for efficient 1024x1024 image generation achieving ~6 seconds per image; The post shows how a single inf2.xlarge
instance costing $0.99/hour can achieve ~10 images per minute, making Inferentia2 a great option for not only efficient and fast but also cost-effective inference of images compared to GPUs.
Deploy BERT for Text Classification on AWS inferentia2 with Amazon SageMaker
Tutorial on how to optimize and deploy BERT model on AWS Inferentia2 using Optimum Neuron and Amazon SageMaker for efficient text classification achieving 4ms latency; The post shows how a single inf2.xlarge instance costing $0.99/hour can achieve 116 inferences/sec and 500 inferences/sec without network overhead, making Inferentia2 a great option for low-latency and cost-effective inference compared to GPUs.