Quick tour
Let’s have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG), such as completing a prompt with new text or translating in another language.
First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we will dig a little bit more and see how the library gives you access to those models and helps you preprocess your data.
All code examples presented in the documentation have a switch on the top left for Pytorch versus TensorFlow. If not, the code is expected to work for both backends without any change needed.
Getting started on a task with a pipeline
The easiest way to use a pretrained model on a given task is to use pipeline().
🤗 Transformers provides the following tasks out of the box:
- Sentiment analysis: is a text positive or negative?
- Text generation (in English): provide a prompt and the model will generate what follows.
- Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place, etc.)
- Question answering: provide the model with some context and a question, extract the answer from the context.
- Filling masked text: given a text with masked words (e.g., replaced by
[MASK]
), fill the blanks. - Summarization: generate a summary of a long text.
- Translation: translate a text in another language.
- Feature extraction: return a tensor representation of the text.
Let’s see how this work for sentiment analysis (the other tasks are all covered in the task summary):
Install the following dependencies (if not already installed):
pip install torch
>>> from transformers import pipeline
>>> classifier = pipeline("sentiment-analysis")
When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will look at both later on, but as an introduction the tokenizer’s job is to preprocess the text for the model, which is then responsible for making predictions. The pipeline groups all of that together, and post-process the predictions to make them readable. For instance:
>>> classifier("We are very happy to show you the 🤗 Transformers library.")
[{'label': 'POSITIVE', 'score': 0.9998}]
That’s encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model, returning a list of dictionaries like this one:
>>> results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])
>>> for result in results:
... print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309
To use with a large dataset, look at iterating over a pipeline
You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is fairly neutral.
By default, the model downloaded for this pipeline is called “distilbert-base-uncased-finetuned-sst-2-english”. We can look at its model page to get more information about it. It uses the DistilBERT architecture and has been fine-tuned on a dataset called SST-2 for the sentiment analysis task.
Let’s say we want to use another model; for instance, one that has been trained on French data. We can search through the model hub that gathers models pretrained on a lot of data by research labs, but also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags “French” and “text-classification” gives back a suggestion “nlptown/bert-base-multilingual-uncased-sentiment”. Let’s see how we can use it.
You can directly pass the name of the model to use to pipeline():
>>> classifier = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")
This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also replace that name by a local folder where you have saved a pretrained model (see below). You can also pass a model object and its associated tokenizer.
We will need two classes for this. The first is AutoTokenizer, which we will use to download the tokenizer associated to the model we picked and instantiate it. The second is AutoModelForSequenceClassification (or TFAutoModelForSequenceClassification if you are using TensorFlow), which we will use to download the model itself. Note that if we were using the library on an other task, the class of the model would change. The task summary tutorial summarizes which class is used for which task.
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
Now, to download the models and tokenizer we found previously, we just have to use the
from_pretrained()
method (feel free to replace model_name
by
any other model from the model hub):
>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
>>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
If you don’t find a model that has been pretrained on some data similar to yours, you will need to fine-tune a pretrained model on your data. We provide example scripts to do so. Once you’re done, don’t forget to share your fine-tuned model on the hub with the community, using this tutorial.
Under the hood: pretrained models
Let’s now see what happens beneath the hood when using those pipelines.
As we saw, the model and tokenizer are created using the from_pretrained
method:
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
>>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
>>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
Using the tokenizer
We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in words (or part of words, punctuation symbols, etc.) usually called tokens. There are multiple rules that can govern that process (you can learn more about them in the tokenizer summary), which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was pretrained.
The second step is to convert those tokens into numbers, to be able to build a tensor out of them and feed them to
the model. To do this, the tokenizer has a vocab, which is the part we download when we instantiate it with the
from_pretrained
method, since we need to use the same vocab as when the model was pretrained.
To apply these steps on a given text, we can just feed it to our tokenizer:
>>> inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
This returns a dictionary string to list of ints. It contains the ids of the tokens, as mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an attention mask that the model will use to have a better understanding of the sequence:
>>> print(inputs)
{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
You can pass a list of sentences directly to your tokenizer. If your goal is to send them through your model as a batch, you probably want to pad them all to the same length, truncate them to the maximum length the model can accept and get tensors back. You can specify all of that to the tokenizer:
>>> pt_batch = tokenizer(
... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
... padding=True,
... truncation=True,
... max_length=512,
... return_tensors="pt",
... )
The padding is automatically applied on the side expected by the model (in this case, on the right), with the padding token the model was pretrained with. The attention mask is also adapted to take the padding into account:
>>> for key, value in pt_batch.items():
... print(f"{key}: {value.numpy().tolist()}")
input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]
You can learn more about tokenizers here.
Using the model
Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will
contain all the relevant information the model needs. If you’re using a TensorFlow model, you can pass the dictionary
keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding **
.
>>> pt_outputs = pt_model(**pt_batch)
In 🤗 Transformers, all outputs are objects that contain the model’s final activations along with other metadata. These objects are described in greater detail here. For now, let’s inspect the output ourselves:
>>> print(pt_outputs)
SequenceClassifierOutput(loss=None, logits=tensor([[-4.0833, 4.3364],
[ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
Notice how the output object has a logits
attribute. You can use this to access the model’s final activations.
All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model before the final activation function (like SoftMax) since this final activation function is often fused with the loss.
Let’s apply the SoftMax activation to get predictions.
>>> from torch import nn
>>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
We can see we get the numbers from before:
>>> print(pt_predictions)
tensor([[2.2043e-04, 9.9978e-01],
[5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)
If you provide the model with labels in addition to inputs, the model output object will also contain a loss
attribute:
>>> import torch
>>> pt_outputs = pt_model(**pt_batch, labels=torch.tensor([1, 0]))
>>> print(pt_outputs)
SequenceClassifierOutput(loss=tensor(0.3167, grad_fn=<NllLossBackward>), logits=tensor([[-4.0833, 4.3364],
[ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
Models are standard torch.nn.Module or tf.keras.Model so you can use them in your usual training loop. 🤗 Transformers also provides a Trainer class to help with your training in PyTorch (taking care of things such as distributed training, mixed precision, etc.) whereas you can leverage the fit()
method in Keras. See the training tutorial for more details.
Pytorch model outputs are special dataclasses so that you can get autocompletion for their attributes in an IDE.
They also behave like a tuple or a dictionary (e.g., you can index with an integer, a slice or a string) in which
case the attributes not set (that have None
values) are ignored.
Once your model is fine-tuned, you can save it with its tokenizer in the following way:
>>> pt_save_directory = "./pt_save_pretrained"
>>> tokenizer.save_pretrained(pt_save_directory)
>>> pt_model.save_pretrained(pt_save_directory)
You can then load this model back using the AutoModel.from_pretrained()
method by passing the
directory name instead of the model name. One cool feature of 🤗 Transformers is that you can easily switch between
PyTorch and TensorFlow: any model saved as before can be loaded back either in PyTorch or TensorFlow.
If you would like to load your saved model in the other framework, first make sure it is installed:
pip install torch
Then, use the corresponding Auto class to load it like this:
>>> from transformers import AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
>>> pt_model = AutoModel.from_pretrained(tf_save_directory, from_tf=True)
Lastly, you can also ask the model to return all hidden states and all attention weights if you need them:
>>> pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
>>> all_hidden_states = pt_outputs.hidden_states
>>> all_attentions = pt_outputs.attentions
Accessing the code
The AutoModel and AutoTokenizer classes are just shortcuts that will automatically work with any pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the code is easy to access and tweak if you need to.
In our previous example, the model was called “distilbert-base-uncased-finetuned-sst-2-english”, which means it’s using the DistilBERT architecture. As AutoModelForSequenceClassification (or TFAutoModelForSequenceClassification if you are using TensorFlow) was used, the model automatically created is then a DistilBertForSequenceClassification. You can look at its documentation for all details relevant to that specific model, or browse the source code. This is how you would directly instantiate model and tokenizer without the auto magic:
>>> from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
>>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
>>> model = DistilBertForSequenceClassification.from_pretrained(model_name)
>>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
Customizing the model
If you want to change how the model itself is built, you can define a custom configuration class. Each architecture comes with its own relevant configuration. For example, DistilBertConfig allows you to specify parameters such as the hidden dimension, dropout rate, etc for DistilBERT. If you do core modifications, like changing the hidden size, you won’t be able to use a pretrained model anymore and will need to train from scratch. You would then instantiate the model directly from this configuration.
Below, we load a predefined vocabulary for a tokenizer with the from_pretrained() method. However, unlike the tokenizer, we wish to initialize the model from scratch. Therefore, we instantiate the model from a configuration instead of using the DistilBertForSequenceClassification.from_pretrained() method.
>>> from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
>>> config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4 * 512)
>>> tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
>>> model = DistilBertForSequenceClassification(config)
For something that only changes the head of the model (for instance, the number of labels), you can still use a
pretrained model for the body. For instance, let’s define a classifier for 10 different labels using a pretrained body.
Instead of creating a new configuration with all the default values just to change the number of labels, we can instead
pass any argument a configuration would take to the from_pretrained
method and it will update the default
configuration appropriately:
>>> from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
>>> model_name = "distilbert-base-uncased"
>>> model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
>>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)