Clarifying bos_token
#27
by
taohoang
- opened
Hi,
In the definition of DataCollatorSpeechSeq2SeqWithPadding in https://huggingface.co/blog/fine-tune-whisper, I am trying to understand the following part:
# if bos token is appended in previous tokenization step,
# cut bos token here as it's append later anyways
if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
labels = labels[:, 1:]
Where will bos token be appended later in training?
After loading the tokenizer, it seems bos_token is <|endoftext|> instead of <|startoftranscript|>:
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Hindi", task="transcribe")
Will this affect the checking for bos_token above?