Fine Tuning a LLM Using Kubernetes with Intel® Xeon® Scalable Processors
Introduction
Large language models (LLM) used for text generation have exploded in popularity, but it's no secret that a massive amount of compute power is required to train these models due to their large model and dataset sizes. Previously, we have seen how efficient multiple CPUs can be when used with software optimizations such as Intel® Extension for PyTorch and Intel® oneCCL Bindings for PyTorch. Kubernetes (or K8s for short) orchestrates running containerized workloads across a cluster of nodes. In this blog, we take a deep dive into the process of fine tuning meta-llama/Llama-2-7b-hf using the medalpaca/medical_meadow_medical_flashcards dataset with multiple Intel® Xeon® Scalable Processors nodes from a K8s cluster.
Table of Contents
- Fine Tuning a LLM Using Kubernetes with Intel® Xeon® Scalable Processors
Why Kubernetes?
K8s orchestration reminds me of Chef Ramsay yelling out orders on Hell's Kitchen -- the teams work together to fulfill the orders as they are given. Some orders go to the red team, and others go to the blue team. Something simple like a salad might be a quick one person job, but a complicated entree might involve multiple people. Similarly, K8s has a scheduler that assigns jobs to nodes on the cluster based on which nodes are available and the resources required for the application. At any given time, a single node might be running several different jobs, or it could be running a single job that consumes all of its resources. If all of the nodes in a cluster are busy, a job is pending while it waits for node resources to become available.
The handling of unexpected issues can also be compared to K8s. If someone is sent to the medic after they burn their hand, the rest of the team continues to work on orders. Similarly, the rest of the K8s cluster continues to function if a node goes down. Also, if someone in the kitchen overcooks a steak they'll need to redo it. If there's a failure in a K8s pod, it gets restarted.
K8s becomes especially useful when there are a lot of jobs being deployed to the cluster (you don't need Chef Ramsay when you're at home cooking mac and cheese). This could be a team of engineers sharing a cluster of nodes, a bunch of multi-node experiments to run, or a production cluster handling numerous requests. In all of these cases, the Kubernetes control plane manages the cluster resources and coordinates which node will be used to run each pod.
Components
In the tutorial, we will be fine tuning Llama 2 with a Hugging Face dataset using multiple CPU nodes. Several different components are involved to run this job on the cluster. The diagram below visualizes the interactions between the components.
Helm Chart
The first component that we're going to talk about is the Helm charts. This is
kind of like the recipe for our job. It brings together all the different components that are used for our job and
allows us to deploy everything using one helm install
command. The K8s resources used in our example are:
- PyTorchJob with multiple workers for fine tuning the model
- Persistent Volume Claim (PVC) used as a shared storage location amoung the workers for dataset and model files
- Secret for gated models (Optional)
- Data access pod (Optional)
The Helm chart has a values.yaml file with parameters that are used in the spec files for the K8s resources. Our values file includes parameters such the name for our K8s resources, the image/tag for the container to use for the worker pods, the number of workers, CPU and memory resources, the arguments for the python script, etc. The values get filled into the K8s spec files when the Helm chart is deployed.
Container
K8s runs jobs in a containerized environment, so the next thing that we're going to need is a docker container. You can think of this like the mixing bowl. In our mixing bowl, we need to include all the dependencies needed for our training job. For optimal performance using Intel Xeon processors, we recommend including Intel® Extension for PyTorch and Intel® oneCCL Bindings for PyTorch.
For this example, we will be using the intel/ai-workflows:torch-2.2.0-huggingface-multinode-py3.10
image from
DockerHub, which includes the following packages:
Package Name | Version | Purpose |
---|---|---|
PyTorch | 2.2.0 | Base framework to train models |
🤗 Transformers | 4.39.3 | Library used to download and fine tune the Hugging Face model |
Intel® Extension for PyTorch | 2.2.0 | Extends PyTorch to provide an extra performance boost on Intel® hardware |
Intel® oneAPI Collective Communications Library | 2.2.0 | Deploy PyTorch jobs on multiple nodes |
Other key things included in the Dockerfile
are MKL, google-perftools
, 🤗 PEFT, 🤗 Datasets,
and OpenSSH to allow the Intel oneAPI CCL to communicate between containers.
Fine Tuning Script
The python script that we
are using fine tunes a causal language model for text generation. It has arguments similar to what you'd see in the
Hugging Face example scripts. All parameters in the
script are the same as those in the values.yaml
, just converted to camelcase to
match the Helm naming convention.
This script can be used to fine tune a chatbot or for instruction tuning.
Chat-based models will use the following prompt format:
<s>[INST] <<SYS>>
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.<</SYS>>
Calculate the median of the following list of integers. 6, 5, 8, 1, 2, 1, 7 [/INST] 5 </s>
Other models will be instruction tuned using the following prompt format:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Calculate the median of the following list of integers.
### Input:
6, 5, 8, 1, 2, 1, 7
### Response:
5
The prompt strings can also be customized with script parameters in order to provide a prompt that is more approriate for your job. For example, you might want to provide a more targeted prompt, such as " You are a helpful, respectful and honest finance assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
The python script needs to be included the docker container, so let's add that to our mixing bowl. Our dockerfile does
a COPY
to add the scripts the image.
Storage
We need a storage location that can be shared among the workers to access the dataset, and save model files. We are using a vanilla K8s cluster with an NFS backed storage class. If you are using a cloud service provider, you could use a cloud storage bucket instead.
Thinking back to our cooking analogy, storage can be compared to the pantry or fridge. You wouldn't add the whole pantry into the mixing bowl, and similarly, our NFS storage location doesn't get added to the container. Instead, the storage location gets mounted into the container so that we have access to read and write from that location without it being built into the image. To achieve this, we are using a persistent volume claim (PVC).
Secret
The last ingredient that we're adding in is the secret sauce. Gated or private models require you to be logged in to download the model. For authentication from the K8s job, we define a secret with a Hugging Face User Read Only Access Token. The token from the secret will be mounted into the container. If the model being trained is not gated or private, this isn't required.
Cluster Requirements
This tutorial requires Kubeflow to be installed
on your cluster. Kubeflow provides features and custom resources that simplify running and scaling machine learning
workloads on K8s clusters. In this example, we are going to be using the PyTorch training operator
from Kubeflow. The PyTorch training operator allows us to run distributed PyTorch training jobs on the cluster without
needing to manually set environment variables such as MASTER_ADDR
, MASTER_PORT
, RANK
, and WORLD_SIZE
. After
Kubeflow has been installed, verify that the PyTorch custom resource has been successfully deployed to your cluster,
using kubectl get crd pytorchjobs.kubeflow.org
and ensuring that the output is similar to:
NAME CREATED AT
pytorchjobs.kubeflow.org 2023-03-24T15:42:17Z
Our cluster uses 4th Gen Intel® Xeon® Scalable processors in order to take advantage of Intel® Advanced Matrix
Extensions (Intel® AMX)
and bfloat16. If role-based access control (RBAC) is
enabled on your cluster, listing nodes and many other cluster wide commands require specific roles to be granted to the
user. The kubectl auth can-i get nodes
command will return "yes" if you are able to list the nodes with
kubectl get nodes
, for example:
NAME STATUS ROLES AGE VERSION
k8s-spr-01 Ready worker 69d v1.22.17
k8s-spr-02 Ready worker 68d v1.22.17
k8s-spr-03 Ready worker 65d v1.22.17
k8s-spr-04 Ready worker 65d v1.22.17
Otherwise, consult your cluster admin to get a list of the nodes available to your user group.
Once you know the names of the nodes, use kubectl describe node <node name>
to get its CPU and memory capacity. We
will be using this information later when setting up the specification for the worker pods.
Tutorial: Fine Tuning Llama 2 using a Kubernetes Cluster
Client Requirements:
kubectl
installed and configured to connect to your cluster- Helm
- Download and extract the Helm chart used in this tutorial:
This tar file includes 3 different examples for the Helm values file:wget https://storage.googleapis.com/public-artifacts/helm_charts/tlt_v0.7.0_hf_helm_chart.tar.gz tar -xvzf tlt_v0.7.0_hf_helm_chart.tar.gz
hf_k8s/chart/values.yaml
: A template for running your own workloadhf_k8s/chart/medical_meadow_values.yaml
: Fine tune Llama 2 using the medalpaca/medical_meadow_medical_flashcards datasethf_k8s/chart/financial_chatbot_values.yaml
: Fine tune a Llama 2 chatbot using a subset of a finance dataset Select one of these to use as your values file for the instructions below.
Step 1: Setup the secret with your Hugging Face token
Get a Hugging Face token with read access and use your terminal to get
the base64 encoding for your token using a terminal using echo <your token> | base64
.
For example:
$ echo hf_ABCDEFG | base64
aGZfQUJDREVGRwo=
Copy and paste the encoded token value into your values yaml file encodedToken
field in the secret
section.
For example, to run the Medical Meadow fine tuning job, open the hf_k8s/chart/medical_meadow_values.yaml
file and
paste in your encoded token on line 23:
secret:
encodedToken: aGZfQUJDREVGRwo=
Step 2: Customize your values.yaml parameters
The hf_k8s/chart/medical_meadow_values.yaml
file is setup to fine tune meta-llama/Llama-2-7b-hf
using the medalpaca/medical_meadow_medical_flashcards
dataset. If you are using the hf_k8s/chart/values.yaml
template, fill in either a datasetName
to use a Hugging Face
dataset, or provide a dataFile
path.
The distributed.train
section of the file can be changed to adjust the training job's dataset, epochs, max steps,
learning rate, LoRA config, enable bfloat16, etc.
The values files also have parameters for setting the pod's security context with your user and group ids to allow running the fine tuning script as a non-root user. If the user and group ids aren't set, it will be run as root.
securityContext:
runAsUser:
runAsGroup:
fsGroup:
allowPrivilegeEscalation: false
There are other parameters in the values.yaml file that need to be configured based on your cluster:
elasticPolicy:
rdzvBackend: c10d
minReplicas: 1
maxReplicas: 4 # Must be greater than or equal to the number of distributed.workers
maxRestarts: 30
distributed:
workers: 4
...
# Resources allocated to each worker
resources:
cpuRequest: 200 # Update based on your hardware config
cpuLimit: 200
memoryRequest: 226Gi
memoryLimit: 226Gi
nodeSelectorLabel: node-type # Update with your node label/value
nodeSelectorValue: spr
# Persistent volume claim storage resources
storage:
storageClassName: nfs-client # Update with your cluster's storage class name
resources: 50Gi
pvcMountPath: /tmp/pvc-mount
The CPU resource limits/requests in the yaml are defined in cpu units where 1 CPU unit is equivalent to 1 physical CPU core or 1 virtual core (depending on whether the node is a physical host or a VM). The amount of CPU and memory limits/requests defined in the yaml should be less than the amount of available CPU/memory capacity on a single machine. It is usually a good idea to not use the entire machine's capacity in order to leave some resources for the kubelet and OS. In order to get "guaranteed" quality of service for the worker pods, set the same CPU and memory amounts for both the requests and limits.
Step 3: Deploy the Helm chart to the cluster
Deploy the Helm chart to the cluster using the kubeflow
namespace:
# Navigate to the hf_k8s directory from the extracted tar file
cd hf_k8s
# Deploy the job using the Helm chart, specifying your values file name with the -f parameter
helm install --namespace kubeflow -f chart/medical_meadow_values.yaml llama2-distributed ./chart
Step 4: Monitor the job
After the Helm chart is deployed to the cluster, the K8s resources like the secret, PVC, and worker pods are
created. The job can be monitored by looking at the pod status using kubectl get pods
. At first, the pods will show
as "Pending" as the containers get pulled and created, then the status should change to "Running".
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
medical-meadow-dataaccess 1/1 Running 0 1h30m
medical-meadow-pytorchjob-worker-0 1/1 Running 0 1h30m
medical-meadow-pytorchjob-worker-1 1/1 Running 0 1h30m
medical-meadow-pytorchjob-worker-2 1/1 Running 0 1h30m
medical-meadow-pytorchjob-worker-3 1/1 Running 0 1h30m
Watch the training logs using kubectl logs <pod name>
. You can also add -f
to stream the log.
$ kubectl logs medical-meadow-pytorchjob-worker-0
...
72%|███████▏ | 2595/3597 [4:08:05<1{'loss': 2.2737, 'learning_rate': 5.543508479288296e-05, 'epoch': 2.17}
...
Step 5: Download the trained model
After the job completes, the trained model can be copied from /tmp/pvc-mount/output/saved_model
(the path defined in
your values file for the train.outputDir
parameter) to the local system using the following command:
kubectl cp --namespace kubeflow medical-meadow-dataaccess:/tmp/pvc-mount/output/saved_model .
Step 6: Clean up
Finally, the resources can be deleted from the cluster using the helm uninstall
command with the name of the Helm job to delete. A list of all the deployed Helm releases can be seen using helm list
.
helm uninstall --namespace kubeflow llama2-distributed
After uninstalling the Helm chart, the resources on the cluster should show a status of "Terminating", and then they will eventually disappear.
Results
We fine tuned Llama2-7b using the
Medical Meadow Flashcards dataset
for 3 epochs on our vanilla k8s cluster using 4th generation Intel® Xeon® Scalable Processors. We used the default
parameters from the values.yaml
file and then experimented with changing the number of workers and turning on and off
mixed precision training. Measuring the accuracy of text generation models can be tricky, since there isn't a clear
right or wrong answer. Instead, we are using the perplexity metric to
roughly gauge how confident the model is in its prediction. We reserved 20% of the data for evaluation. We saw a notable
reduction in training time when increasing the number of workers. You won't see an exact 2x and 4x improvement when
going from a single node to 2 nodes and 4 nodes, since there is some overhead for communication between the nodes and
some resources are reserved for the OS. We also saw a significant performance improvement when using BFloat16 training
instead of FP32. In all cases, the perplexity value stayed relatively consistent. This is a good thing,
because sometimes models can see an accuracy drop when scaling the job or training with less bits (like BFloat16).
Next Steps
Now that we've fine tuned Llama 2 using the Medical Meadow Flashcards dataset, you're probably wondering how to use this
information to run your own workload on K8s. If you're using Llama 2 or a similar generative text model, you are
in luck, because the same docker container and script can be reused. You'd just have to edit the parameters in the
distributed.train
section of the values.yaml
file to use your dataset and tweak other parameters (learning rate,
epochs, etc) as needed. If you want to use your own fine tuning script, you will need to build a docker container that
includes the libraries needed to run the training job, along with your script. The image needs to be copied to the
cluster nodes or pushed to a container registry. The container image and tag need to be updated in the values.yaml
file along with the script name and all of the python parameters for your script.
All of the scripts, Dockerfile, and spec files for the tutorial can be found in our GitHub repo.
The Intel® AI Containers for PyTorch base containers can be used to run your own distributed training job.
For other resources on distributed training, check out the Hugging Face documentation for efficient training on multiple CPUs.
Acknowledgments
Thank you to my colleagues who made contributions and helped to review this blog: Harsha Ramayanam, Omar Khleif, Abolfazl Shahbazi, Rajesh Poornachandran, Melanie Buehler, Daniel De Leon, Tyler Wilbers, and Matthew Fleetwood.
Citations
@misc{touvron2023llama,
title={Llama 2: Open Foundation and Fine-Tuned Chat Models},
author={Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller and Cynthia Gao and Vedanuj Goswami and Naman Goyal and Anthony Hartshorn and Saghar Hosseini and Rui Hou and Hakan Inan and Marcin Kardas and Viktor Kerkez and Madian Khabsa and Isabel Kloumann and Artem Korenev and Punit Singh Koura and Marie-Anne Lachaux and Thibaut Lavril and Jenya Lee and Diana Liskovich and Yinghai Lu and Yuning Mao and Xavier Martinet and Todor Mihaylov and Pushkar Mishra and Igor Molybog and Yixin Nie and Andrew Poulton and Jeremy Reizenstein and Rashi Rungta and Kalyan Saladi and Alan Schelten and Ruan Silva and Eric Michael Smith and Ranjan Subramanian and Xiaoqing Ellen Tan and Binh Tang and Ross Taylor and Adina Williams and Jian Xiang Kuan and Puxin Xu and Zheng Yan and Iliyan Zarov and Yuchen Zhang and Angela Fan and Melanie Kambadur and Sharan Narang and Aurelien Rodriguez and Robert Stojnic and Sergey Edunov and Thomas Scialom},
year={2023},
eprint={2307.09288},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{han2023medalpaca,
title={MedAlpaca--An Open-Source Collection of Medical Conversational AI Models and Training Data},
author={Han, Tianyu and Adams, Lisa C and Papaioannou, Jens-Michalis and Grundmann, Paul and Oberhauser, Tom and L{\"o}ser, Alexander and Truhn, Daniel and Bressem, Keno K},
journal={arXiv preprint arXiv:2304.08247},
year={2023}
}