---
tags:
- text2text-generation
metrics:
- bleu
- chrf
model-index:
- name: cantonese-chinese-translation-gen1
  results: []
datasets:
- raptorkwok/cantonese-chinese-dataset-gen2
language:
- zh
---

# Cantonese-Written Chinese Translation Model

This model is a fine-tuned version of [fnlp/bart-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) on [Cantonese-Written Chinese Dataset Gen2](https://huggingface.co/raptorkwok/cantonese-chinese-dataset-gen2).
It achieves the following results on the evaluation set:
- Loss: 1.5413
- Bleu: 40.7808
- Chrf: 42.5628
- Gen Len: 13.2556

## Model description

The model is based on BART Chinese model, trained on 1M Cantonese-Written Chinese Parallel Corpus data.

## Intended uses & limitations

Its intended use is to translate Cantonese sentences to Written Chinese accurately.

## Training and evaluation data

Training and evaluation data is provided by the [Cantonese-Written Chinese Dataset Gen2](https://huggingface.co/raptorkwok/cantonese-chinese-dataset-gen2).

## Training procedure

The training was performed using `Seq2SeqTrainer`.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Bleu    | Chrf    | Gen Len |
|:-------------:|:-----:|:-----:|:---------------:|:-------:|:-------:|:-------:|
| 0.2275        | 0.05  | 5000  | 1.5256          | 40.6521 | 42.475  | 13.2277 |
| 0.1752        | 0.1   | 10000 | 1.5413          | 40.7808 | 42.5628 | 13.2556 |
| 0.1533        | 0.15  | 15000 | 1.5938          | 40.7698 | 42.5348 | 13.2678 |
| 0.1442        | 0.2   | 20000 | 1.6487          | 40.6062 | 42.353  | 13.2602 |
| 0.1317        | 0.24  | 25000 | 1.7148          | 40.569  | 42.2753 | 13.2798 |


### Framework versions

- Transformers 4.28.1
- Pytorch 2.3.1+cu121
- Datasets 2.19.1
- Tokenizers 0.13.3