Fine-tuning a large language model on Kaggle Notebooks (or even on your own computer) for solving real-world tasks

Community Article Published February 21, 2024

Image

credits: DALL·E 3

Exploring, with simple words and concepts, some theory and ideas on adapting LLMs to your needs

Thanks to their in-context learning, generative large language models (LLMs) are a feasible solution if you want a model to tackle your specific problem. In fact, we can provide the LLM with a few examples of the target task directly through the input prompt, which it wasn’t explicitly trained on. However, this can prove dissatisfying because the LLM may need to learn the nuances of complex problems, and you cannot fit too many examples in a prompt. Also, you can host your own model on your own premises and have control of the data you provide to external sources. The solution is fine-tuning your local LLM because fine-tuning changes the behavior and increases the knowledge of an LLM model of your choice.

Fine-tuning requires more high-quality data, more computations, and some effort because you must prompt and code a solution. Still, it rewards you with LLMs that are less prone to hallucinate, can be hosted on your servers or even your computers, and are best suited to tasks you want the model to execute at its best. In these two short articles, I will present all the theory basics and tools to fine-tune a model for a specific problem in a Kaggle notebook, easily accessible by everyone. The theory part owes a lot to the writings by Sebastian Raschka in his community blog posts on lightning.ai, where he systematically explored the fine-tuning methods for language models.

Since we’ll be working with a Llama 2 model taken from Kaggle Models, you must visit the model’s page on Kaggle (https://www.kaggle.com/models/metaresearch/llama-2) and follow the instructions there to ask Meta for the access to their model (you can use this page: https://ai.meta.com/resources/models-and-libraries/llama-downloads/).

Fine-tuning for language models already has a history of working with generative models like GPT, based on decoder architectures, and embedding-centric models like BERT, which rely on encoder architectures (the E in BERT stands for encoder). This involves keeping frozen in terms of weight update a larger or lesser part of the language model and attaching a machine learning classifier (typically a logistic regression model, but it can be a support vector classifier, a random forest, or an XGBoost model) or some additional neural architecture to the end of the model. The more you keep unfrozen the original language model, the more parts of it, especially the embeddings, will adapt to your problem (and you will get better evaluation scores for your model), but that will require a lot of computation power if the model is large (and LLMs are incredibly huge in terms of weights and layers) and also a lot of data because you need a lot of evidence for correctly updating many parameters in a model. Suppose you have a few labeled examples of your task, which is extremely common for business applications and not many resources. In that case, the right solution is to keep most of the original model frozen and update the parameters of its classification terminal part.

Therefore, there are increased limitations when it comes to large language models because you cannot easily have the computational power and volume of data sufficient to update its layers. Fortunately, various ingenious approaches for fine-tuning LLMs have been devised in recent years, ensuring excellent modeling results with minimal parameter training. These techniques are commonly known as parameter-efficient fine-tuning methods or PEFT. All PEFT methods involve prompt modification, adapter methods, and parameter updates:

  • Prompt modification involves altering the input prompt to attain the desired results. It can be achieved by hard changes when we directly change the prompt by trial and error or by soft changes when we rely on backpropagation to figure out how to enhance the embeddings of the existing prompt by learning an additional tensor of free embeddings. These methods intervene at the beginning of the architecture of LLMs.
  • Adapter methods involve adding inside the architecture of the LLM a few adaptable layers that are updated by backpropagation and modify the model’s behavior. The methods intervene in the middle of the architecture of LLMs
  • Parameter updates may involve a specific part of the network weights or the network itself by a low-rank adaptation of the weights (https://arxiv.org/abs/2106.09685), a method that “can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by three times”.

In particular, parameter updates by low-rank adaptation (LoRA) is a kind of hacking the regular backpropagation updates by splitting the update matrix into two smaller matrices that, multiplied together, can give back the original update matrix. This is similar to matrix decomposition (such as SVD), where a reduction is obtained by allowing an inevitable loss in the contents of the original matrix. In our case, when training LLMs for specific tasks, a loss of its original complexity is actually permissible for the LLM to gain expertise on our task of interest.

Therefore, if the update matrix dimension for a layer is 1,024 by 1,024, which equates to 1,048,576 numeric values, a decomposition into two matrices sized 1,024 by 16 and 16 by 1,024, which multiplied can return something similar to the original matrix, will decrease the numeric values to be handled to 32,768.

This matrix decomposition is left to the backpropagation of the neural network, and the hyperparameter r allows us to designate the rank of the low-rank matrices for adaptation. A smaller r corresponds to a more straightforward low-rank matrix, reducing the number of parameters for adaptation. Consequently, this can accelerate training and potentially lower computational demands. In LoRA, selecting a smaller value for r involves a trade-off between model complexity, adaptation capability, and the potential for underfitting or overfitting. Therefore, conducting experiments with various r values is crucial to strike the right balance between LoRA parameters.

Moreover, after finishing fine-tuning, if we keep the low-rank matrices we used for the updates, which do not weigh much, we can reuse them by multiplication on the original LLM that we fine-tuned without any need to update the weights of the model itself directly. At this point, we can save memory and disk space by reducing the size of the LLM on which LoRA has been used. The answer is quantizing the original LLM, reducing its precision to 4-bit precision. It is just like compressing a file, and in the same way, the LLM is kept compressed (i.e., quantized) only to be expanded when it is necessary to compute the LoRA matrix reduction and update. In this way, you can tune large language models on a single GPU while preserving the performance of the LLM after fine-tuning. This approach is called QLoRA, based on the work by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer (see https://arxiv.org/abs/2305.14314)..) It is also available as an open-source project on GitHub.

In the upcoming second part of this article, I will offer references and insights into the practical aspects of working with LLMs for fine-tuning tasks, especially in resource-constrained environments like Kaggle Notebooks. I will also demonstrate how to effortlessly put these techniques into practice with just a few commands and minimal configuration settings.

Hands-on fine-tuning for financial sentiment analysis

You can find all the code in this section at this Kaggle Notebook: Fine-tune Llama-2 for Sentiment Analysis

We will deal with sentiment analysis of financial and economic information for this hands-on tutorial on fine-tuning a Llama 2 model on Kaggle Notebooks, showing how to handle such a task with limited and commonly available resources. Sentiment analysis on financial and economic information is highly relevant for businesses for several key reasons, ranging from market insights (gain valuable insights into market trends, investor confidence, and consumer behavior) to risk management (identifying potential reputational risks) to investment decisions (gauging the sentiment of stakeholders, investors, and the general public businesses can assess the potential success of various investment opportunities).

Before the technicalities of fine-tuning a large language model like Llama 2, we had to find the correct dataset to demonstrate the potentialities of fine-tuning.

Particularly within finance and economic texts, annotated datasets are notably rare, with many exclusively reserved for proprietary purposes. In 2014, scholars from the Aalto University School of Business introduced a set of approximately 5,000 sentences to address the issue of insufficient training data (Malo, P., Sinha, A., Korhonen, P., Wallenius, J., & Takala, P., 2014, “Good debt or bad debt: Detecting semantic orientations in economic texts.” Journal of the Association for Information Science and Technology, 65[4], 782–796 - https://arxiv.org/abs/1307.5336). This collection aimed to establish human-annotated benchmarks, serving as a standard for evaluating alternative modeling techniques. The involved annotators (16 people with adequate background knowledge of financial markets) were instructed to assess the sentences solely from an investor's perspective, evaluating whether the news potentially holds a positive, negative, or neutral impact on the stock price.

The FinancialPhraseBank dataset is a comprehensive collection that captures the sentiments of financial news headlines from the viewpoint of a retail investor. Comprising two key columns, “Sentiment” and “News Headline,” the dataset effectively classifies sentiments as negative, neutral, or positive. This structured dataset is a valuable resource for analyzing and understanding the complex dynamics of sentiment in financial news. It has been used in various studies and research initiatives since its inception in the paper published in the Journal of the Association for Information Science and Technology in 2014.

The data is available under the license CC BY-NC-SA 3.0 DEED, and it can be found complete with detailed descriptions and instructions at https://huggingface.co/datasets/financial_phrasebank. There are also a couple of Kaggle Datasets mirrors, too. In our example, we sample from all the available data (4840 sentences from English language financial news categorized by sentiment) 900 examples for training and 900 for testing. The examples in the training and testing sets are balanced and have the same number of examples of positive, neutral, and negative samples. We also use a sample of about one hundred examples, mainly of remaining positive and neutral examples (not so many negative examples were left) for evaluation purposes during training (we just use evaluation for monitoring; no decision is taken based on such a sample).

Without much ado, we just point out to the Kaggle notebook where all the cells are commented on step by step, showing how to structure the analysis:

In this article, we will illustrate instead the logical steps of fine-tuning. From a larger perspective, as in any machine learning project, you:

  1. retrieve data
  2. arrange data for training, validation, and testing
  3. instantiate your model
  4. evaluate your model as it is
  5. fine-tune (train) your model
  6. evaluate your model

When dealing with LLMs, however, it makes sense to evaluate the model, inducted just by hard prompting engineering, in order to establish a benchmark that can make sense to your work (if your LLM is already skillful enough in achieving the desired task, you actually do not need to perform any further fine-tuning).

Let’s now delve into the practicalities of instantiating and fine-tuning your model.

First of all, the used packages are:

  • PyTorch 2.1.2 (previously 2.0.0)
  • transformers 4.36.2 (previously 4.31)
  • datasets 2.16.1
  • accelerate 0.26.1 (previously 0.23.0)
  • bitsandbytes 0.42.0 (previously 0.41.1)

As for trl, I picked a commit from GitHub published on Jan 22, 2024, and for peft, I retrieved another commit published on the same date (so both packages are as fresh as possible)

Then, you need to define what LLM you are going to tune.

model_name = "../input/llama-2/pytorch/7b-hf/1"

How choice fell on Llama 2 7b-hf, the 7B pre-trained model from Meta, converted for the Hugging Face Transformers format. Llama 2 constitutes a series of preexisting and optimized generative text models, varying in size from 7 billion to 70 billion parameters. Employing an enhanced transformer architecture, Llama 2 operates as an auto-regressive language model. Its fine-tuned iterations involve both supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF), ensuring conformity with human standards for helpfulness and safety. Apart from being an already well-performing LLM, the choice for this model resides on the fact that it is the most nimble of the Llama family and, thus, the most suitable to demonstrate how even the smaller LLMs are good choices for fine-tuning for specialistic tasks.

Our next step is defining the BitsAndBytes configuration.

compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

Bitsandbytes is a Python package developed by Tim Dettmers, which acts as a lightweight wrapper around CUDA custom functions, particularly 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions. It allows running models stored in 4-bit precision: while 4-bit bitsandbytes stores weights in 4-bits, the computation still happens in 16 or 32-bit, and here any combination can be chosen (float16, bfloat16, float32, and so on). The idea behind Bitsandbytes has been formalized in the paper by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer (see https://arxiv.org/abs/2305.14314).).

You can actually think of it as a compressor of the LLM that can allow us to safely store it both on disk and memory of a standard computer or server: the neural network is stored to 4-bit precision (normalized float 4 which has the better performances), potentially saving a lot from the typical 32-bit precision. Additionally, to increase the compression, one can opt for bnb_4bit_use_double_quant (but we don’t in our example), which implements a secondary quantization following the initial one, resulting in a supplementary reduction of 0.4 bits per parameter. However, when computing on the network, computations are executed according to the bnb_4bit_compute_dtype defined by us, which is 16-bit precision, a suitable numeric precision comprising both fast and exact computations. This decompression phase may take more time, according to the reductions previously obtained.

As a next step, once initialized the Bitsandbytes compression is to load our model using HuggingFace (HF) AutoModelForCausalLM and its tokenizer:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config, 
)
model.config.use_cache = False
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_name, 
                                          trust_remote_code=True,
                                         )
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Here, apart from the quantization_config (the Bitsandbytes compression) and the device_ap set to “auto” so you can leverage whatever you have on your system (CPU or GPUs), we have to notice as specifics for this model, the pretraining_tp parameter necessarily set to one (a value stated by HF documentation necessary to ensure exact reproducibility of the pretraining results) and the use_cache set to False (whether or not the model should return the last key/values attentions, not necessary for Llama). On the hand of the tokenizer, the pad token is equated to the eos token ( the end-of-sequence token used to indicate the end of a sequence of tokens), and the padding side is set to be the right one, commonly indicated as the right side to use when working with Llama models.

After instantiating the model, we have to prepare the training phase, which requires implementing a LoRA strategy based on a reduced number of parameters to update to adapt the original LLM to our task (see the previous article for more details).

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear",
)

The LoRA config specifies the parameters for PEFT. Following are the explained parameters that we use:

  • r: The rank of the LoRA update matrices. The reduction coefficient represents a trade-off: the lower it is, the less memory is consumed, but with increased approximation during updates.
  • lora_alpha: The learning rate for the LoRA update matrices. As a rule of thumb, remember that it should be the double of the r value.
  • lora_dropout: The dropout probability for the LoRA update matrices.
  • bias: The type of bias to use. The possible values are none, additive, and learned. We go for none because the option removes biases from the LoRA model, which can reduce the model size by up to 20%.
  • task_type: The type of task that the model is being trained for. The possible values are CAUSAL_LM and MASKED_LM. Many say it doesn’t make a difference, but CAUSAL_LM is the right choice for our purpose.

Finally, we have to explain about the final adding the parameter target_modules=”all-linear” to LoraConfig. The LoraConfig object contains a target_modules parameter to be expressed as a list or an array. In some examples you find online, the target modules commonly are [“query_key_value”]; somewhere else, they are something else (in our case, the linear layers, expressed by “all-linear” string value), but always referring to the Transformer architecture. The choice of what layers to fine-tune actually depends on what you want to achieve (and what works better with your problem). As stated in the LoRA paper (https://arxiv.org/abs/2106.09685), Hu, Edward J., et al. “Lora: Low-rank adaptation of large language models.” arXiv preprint arXiv:2106.09685 (2021), “_we can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable parameters_” and that “_we limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules … both for simplicity and parameter-efficiency_”. Finally, the paper states that “_we leave the empirical investigation of adapting the MLP layers, LayerNorm layers, and biases to a future work_” implying that you can actually fine-tune whatever layers you want based on the results you obtain and your “parameter budget” (more layers you fine-tune, more computations and memory are required). This is stated even more clearly in section 7.1 of the paper, “_WHICH WEIGHT MATRICES IN TRANSFORMER SHOULD WE APPLY LORA TO?_”, where the choices of the author of the paper are justified by their parameter “budget”, but you are not limited to just that, you have to look for the best performance overall given your architecture and problem.

The default LoRA settings in peft adhere to the original LoRA paper, incorporating trainable weights into each attention block's query and value layers. This is what I did in the first implementation of the fine-tuning. However, in the QLoRA paper (https://huggingface.co/papers/2305.14314), research revealed that introducing trainable weights to all linear layers in a transformer model enhances performance to match that of full-finetuning. Given that the selection of modules may differ based on the architecture, and you would have to search manually in the architecture of the model of your choice for such linear layers, they have introduced a user-friendly shorthand: simply specify target_modules=’all-linear,’ and let the left package take care of the rest.

After defining LoRA settings, separately, we have to go for the training parameters:

training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    evaluation_strategy="epoch"
)

The training_arguments object specifies the parameters for training the model. The following are some of the most important parameters:

  • output_dir: The directory where the training logs and checkpoints will be saved.
  • num_train_epochs: The number of epochs to train the model for.
  • per_device_train_batch_size: The number of samples in each batch on each device.
  • gradient_accumulation_steps: The number of batches accumulating gradients before updating the model parameters.
  • optim: The optimizer to use for training the model. Our choice is for
    the paged_adamw_32bit optimizer, a variant of the AdamW optimizer designed to be more efficient on 32-bit GPUs. It does this by breaking the model parameters into smaller pages and optimizing each page separately. This can reduce the memory usage of the optimizer and improve its performance on 32-bit GPUs.
  • save_steps: The number of steps after which to save a checkpoint.
  • logging_steps: The number of steps after which to log the training metrics.
  • learning_rate: The learning rate for the optimizer.
  • weight_decay: The weight decay parameter for the optimizer.
  • fp16: Whether to use 16-bit floating-point precision. Training on GPU with fp16 set to True, as we do, can reduce memory usage by up to half, improve training speed by up to 2x, and reduce training costs by up to half. However, it can also reduce the accuracy of the trained model and make the training process more difficult.
  • bf16: Whether to use BFloat16 precision (not for our GPU).
  • max_grad_norm: The maximum gradient norm. The maximum gradient norm is a hyperparameter used to control the magnitude of the gradient updates during training. It is relevant in training because it can help to prevent the model from becoming unstable and overfitting to the training data by taking too strong updates.
  • max_steps: The maximum number of steps to train the model for.
  • warmup_ratio: The proportion of the training steps to use for warming up the learning rate, i.e., the proportion of the training steps to gradually increase the learning rate from 0 to its final value. It is relevant in training because the warm-up can help improve the model's stability and performance.
  • group_by_length: Whether to group the training samples by length to minimize padding applied and be more efficient.
  • lr_scheduler_type: The type of learning rate scheduler to use. Our choice is the cosine scheduler, which gradually increases the learning rate at the beginning of training, thus helping the model learn the basic features of the data quickly. Then, it gradually decreases the learning rate towards the end of the training, which helps the model converge to a better solution.
  • report_to: The tools to report the training metrics to. Our choice is to use TensorBoard.
  • evaluation_strategy: The strategy for evaluating the model during training. By deciding on “epoch”, we have an evaluation of every epoch on the eval dataset, which can help us figure out if training and eval measures are diverging or not.

Finally, we can define the training itself, which is entrusted to the SFTTrainer from the trl package. The trl is a library by HuggingFace providing a set of tools to train transformer language models with Reinforcement Learning and other methods, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization step (PPO).

The trl library now also simplifies the process of setting up a model and tokenizer for conversational AI tasks with the help of the setup_chat_format() function. This function performs the following tasks:

  1. Introduces special tokens, such as <s> and <e>, signifying the beginning and end of a conversation to the tokenizer.
  2. Adjusts the model’s embedding layer to accommodate these newly added tokens.
  3. Defines the chat template of the tokenizer, responsible for formatting input data into a conversation-like structure. The default template is chatml, which was inspired by OpenAI.
  4. Additionally, users have the option to specify the resize_to_multiple_of parameter, enabling them to resize the embedding layer to a multiple of the provided argument (e.g., 64).

Here is an example of how to use this function:

from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
# Set up the chat format with default 'chatml' format
model, tokenizer = setup_chat_format(model, tokenizer)

Adding special tokens to a language model during fine-tuning is crucial, especially when training chat models. These tokens are pivotal in delineating the various roles within a conversation, such as the user, assistant, and system. By inserting these tokens strategically, the model gains an understanding of the structural components and the sequential flow inherent in a conversation.

In other words, the setup provided by set_chat_format assists the model in recognizing the nuances of conversational dynamics. The model becomes attuned to transitions between different speakers and comprehends the contextual cues associated with each role. This enhanced awareness is essential for the model to generate coherent, contextually appropriate responses within the context of a chat environment.

Instead, as for as training, the trl package provides the SFTTrainer, a class for Supervised fine-tuning (or SFT for short). SFT is a technique commonly used in machine learning, particularly in the context of deep learning, to adapt a pre-trained model to a specific task or dataset.

Here's how it typically works:

  • Pre-training: Initially, a neural network model is trained on a large dataset for a general task, such as image classification on a dataset like ImageNet. During this pre-training phase, the model learns to recognize high-level features and patterns from the data. In our case, we are leveraging a LLM such as Llama 2.

  • Fine-tuning: After pre-training, the model can be further trained or fine-tuned on a smaller, task-specific dataset. This fine-tuning process involves updating the parameters of the pre-trained model using the new dataset. However, instead of starting the training from scratch, the model starts with the weights learned during pre-training. This allows the model to quickly adapt to the new task or dataset by adjusting its parameters to better fit the new data.

  • Supervision: The fine-tuning process is supervised, meaning that the model is provided with labeled examples (input-output pairs) from the task-specific dataset. This supervision guides the learning process and helps the model improve its performance on the specific task.

Supervised fine-tuning is particularly useful when you have a small dataset available for your target task, as it leverages the knowledge encoded in the pre-trained model while still adapting to the specifics of the new task. This approach often leads to faster convergence and better performance compared to training a model from scratch, especially when the pre-trained model has been trained on a large and diverse dataset.

Here is our setting up of the SFTTrainer:

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
    max_seq_length=1024,
)

The SFTTrainer object is initialized with the following arguments:

  • model: The model to be trained.
  • train_dataset: The training dataset.
  • eval_dataset: The evaluation dataset.
  • peft_config: The PEFT configuration.
  • dataset_text_field: The name of the text field in the dataset (we used the HuggingFace Dataset implementation).
  • tokenizer: The tokenizer to use.
  • args: The training arguments we previously set.
  • packing: Whether to pack the training samples.
  • max_seq_length: The maximum sequence length.

This basically completes our fine-tuning work because all that is left to do is the training itself and then save the updated model to disk:

trainer.train()
trainer.model.save_pretrained("trained-model")

However, we cannot say everythin is completed if we cannot re-use or share our fine-tuned model. How do you save your fine-tuned model and publish or re-use it?

A few more commands will do the magic, though they require quite a lot of free CPU and GPU memory and that means, if we keep on operating on the same Kaggle notebook, we need to do some cleaning.

Things start after we have saved our fine-tuned QLoRA weights to disk:

trainer.save_model()
tokenizer.save_pretrained(output_dir)

The point here is that we are just saving QLora weights, which are a modifier (by matrix multiplication) of our original model (in our example, a LLama 2 7B). In fact, when working with QLoRA, we exclusively train adapters instead of the entire model. So, when you save the model during training, you only preserve the adapter weights, not the entire model.

If you want to save the entire model for easier use with Text Generation Inference, you can merge the adapter weights into the model weights using the merge_and_unload method. Then, you can save the model using the save_pretrained method. This will create a default model that’s ready for inference tasks. A simple command can achieve merging, but first, we have to clean up our memory:

import gc
del [model, tokenizer, peft_config, trainer, train_data, eval_data, bnb_config, training_arguments]
del [df, X_train, X_eval]
del [TrainingArguments, SFTTrainer, LoraConfig, BitsAndBytesConfig]
for _ in range(100):
    torch.cuda.empty_cache()
    gc.collect()

After deleting the models and data we won’t use anymore, we garbage collect the memory with gc.collect() and clean the GPU memory cache by torch.cuda.empty_cache().

Then, we can proceed to merge the weights and use the merged model for our testing purposes.

from peft import AutoPeftModelForCausalLM
finetuned_model = "./trained_weigths/"
compute_dtype = getattr(torch, "float16")
tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/llama-2/pytorch/7b-hf/1")
model = AutoPeftModelForCausalLM.from_pretrained(
     finetuned_model,
     torch_dtype=compute_dtype,
     return_dict=False,
     low_cpu_mem_usage=True,
     device_map=device,
)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model",safe_serialization=True, max_shard_size="2GB")
tokenizer.save_pretrained("./merged_model")

Several tasks are performed in the above code snippet to the QLoRA weights, the original model, and its associated tokenizer. Firstly, relevant modules are imported, including the AutoPeftModelForCausalLM from the peft package, and it relies on existing components like torch and AutoTokenizer from the transformers library.

Paths and configurations are then defined, such as the directory containing the fine-tuned model weights (finetuned_model), and the data type for computations is set to float16 (compute_dtype). The tokenizer is loaded from the LLama 2 model location. Subsequently, the model is loaded using specified configurations, possibly including optimizations for memory usage. Following loading, the model undergoes a merging and unloading process to consolidate the QLoRA and original weights together. This operation takes time and quite a lot of memory. If errors happen here, it is because you don’t have enough available memory (typically: NotImplementedError: Cannot copy out of meta tensor; no data!). Just recheck the situation with your memory (both CPU and GPU; a nvidia-smi command may help), clean better, collect memory garbage, and retry.

The merged model is finally saved to a designated directory, ensuring safe serialization and limiting shard size to 2GB. Furthermore, the tokenizer is saved alongside the merged model, facilitating future use.

That’s all. The model is now stored in a new directory, ready to be loaded and used for any task you need. As a last step, we just need to test the model on our test set.

A classification report highlights:

 Accuracy: 0.851
 Accuracy for label 0: 0.913
 Accuracy for label 1: 0.863
 Accuracy for label 2: 0.777

 Classification Report:
               precision    recall  f1-score   support

            0       0.95      0.91      0.93       300
            1       0.74      0.86      0.80       300
            2       0.88      0.78      0.82       300

     accuracy                           0.85       900
    macro avg       0.86      0.85      0.85       900
 weighted avg       0.86      0.85      0.85       900


 Confusion Matrix:
 [[274  24   2]
  [ 11 259  30]
  [  2  65 233]]

Which is definitely a strong improvement over a simpler baseline on the very same problem (exactly with the same training and testing data) that returns a 0.623 as overall accuracy (see: LSTM Baseline for Sentiment Analysis).

Reprising hands-on fine-tuning for financial sentiment analysis with Mistral 7B Instruct v0.2 and Phi-2

After fine-tuning LLama 7B on a dataset for financial sentiment analysis on consumer-grade, easily accessible, and free GPUs (https://www.kaggle.com/code/lucamassaron/fine-tune-llama-2-for-sentiment-analysis/) you can re-use the very same code to fine-tune also most recently appeared large language models such as :

In this article, I will present the exciting characteristics of these new large language models and how to modify the starting LLama fine-tuning to adapt to each of them.

The new models

Mistral 7B Instruct v0.2 builds upon the foundation of its predecessor, Mistral 7B Instruct v0.1, introducing refined instruct-finetuning techniques that elevate its capabilities. Everything starts from the Mistral 7B developed by Mistral, a Paris-based AI startup founded by former Google’s DeepMind and Meta employees, aiming to compete with OpenAI in constructing, training, and applying large language models and generative AI.

Such a 7.3B parameter model, Mistral 7B, stands out among its counterparts, consistently surpassing Llama 2 13B on all benchmarks and matching Llama 1 34B performance on numerous tasks. It even rivals CodeLlama 7B’s proficiency in code-related areas while maintaining its excellence in English-based tasks (but it can egregiously handle all European languages).

To achieve this remarkable level of performance, Mistral 7B employs two innovative techniques: Grouped-query attention (GQA) for accelerated inference and Sliding Window Attention (SWA) for efficiently handling lengthy sequences at a lower cost.

GQA streamlines the inference process by grouping and processing relevant query terms in parallel, reducing computational time and enhancing overall speed. SWA, on the other hand, tackles the challenge of operating on lengthy sequences by dividing them into smaller windows and applying attention mechanisms to each window independently, resulting in more efficient processing and reduced memory consumption.

The Mistral 7B Instruct model is designed to be fine-tuned for specific tasks, such as instruction following, creative text generation, and question answering, thus proving how flexible Mistral 7B is to be fine-tuned. As a caveat, it has no built-in moderation mechanism to filter out inappropriate or harmful content.

Phi-2 is instead a small language model (LLM) developed by Microsoft Research. It has only 2.7 billion parameters, significantly smaller than other LLMs. Its training has been based on a similar corpus than Phi-1 and Phi-1.5, focusing on “textbook-quality” data, including subsets of Python codes from The Stack v1.2, Q&A content from StackOverflow, competition code from code_contests, and synthetic Python textbooks and exercises generated by gpt-3.5-turbo-0301. Also Phi-2 has not undergone fine-tuning through reinforcement learning from human feedback, hence there is no filtering of any kind.

Code adjustments

Setting Mistral 7B Instruct to work is a breeze (no pun intended 😄). All you must consider is that to utilize instruction fine-tuning, you need to enclose your prompt between [INST] and [/INST] markers. That’s all! After the fine-tuning process, the results return a top performance for all the classes in terms of detected sentiment on our test set:

Accuracy: 0.868
Accuracy for label negative: 0.977
Accuracy for label neutral: 0.743
Accuracy for label positive: 0.883

Phi-2 requires more work because it has less stringent requirements for instructions and displays a very peculiar behavior. It tends to deal with the question as a quiz and to return even unrequited elements from the texts it has used for its original learning. Therefore, after evaluating the sentiment of a text, it eruditely starts a discussion about the Mughal empire. The most efficient way to obtain answers from the network is to limit the response tokens to at least 3 to allow extra spaces and answer letters to appear before the prediction (something that can’t be avoided) and to structure the prompt as:

The sentiment of the following phrase: ‘…’

Solution: The correct option is ...

Another essential fact about Phi-2 is that you need to declare the target modules you want to adjust when setting the parameters for the LoRA (Low-Rank Attention) module, a parameter reduction technique used to compress the attention matrices in a transformer model. Here, we found it is necessary to specify “Wqkv” and “out_proj explicitly”. “Wqkv” and “out_proj” are modules in the Transformer architecture used for attention and output projection.

Wqkv is a 3-layer feed-forward network that generates the attention mechanism's query, key, and value vectors. These vectors are then used to compute the attention scores, which are used to determine the relevance of each word in the input sequence to each word in the output sequence.

out_proj is a linear layer used to project the decoder output into the vocabulary space. The layer is responsible for converting the decoder’s hidden state into a probability distribution over the vocabulary, which is then used to select the next token to generate.

In the context of the Phi-2 model, these modules are used to fine-tune the model for instruction following tasks. The model can learn to understand better and respond to instructions by fine-tuning these modules.

By doing so, the results are relatively less performing than Mistral Instruct but better than LLama and with a much smaller model:

Accuracy: 0.856
Accuracy for label negative: 0.973
Accuracy for label neutral: 0.743
Accuracy for label positive: 0.850

Conclusions

This completes our tour of the step for fine-tuning an LLM such as Meta’s LLama 2 (and Mistral and Phi2) in Kaggle Notebooks (it can work on consumer hardware, too). As for many machine learning problems, after grasping the technicalities for running the learning, everything boils down to a good understanding of the problem, proper data preparation, and some experimentation to adapt your tools to the problem (and vice versa if necessary).

Code references: