Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Fine-tuned-CodeT5 (fine-tune the CodeT5-base model)

Fine-tuned CodeT5 model. The code for fine-tuning is released in this repository

The pre-trained model was introduced in the paper CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi and first released in this repository.

Model description

This model is fine-tuned on CodeT5-base to accomplish the transfer-learning downstream task-Python code generation. CodeT5 is a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL.

How to use

Here is how to use this model:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained('gangqinxiao13/fine-tuned-codet5')
model = T5ForConditionalGeneration.from_pretrained('gangqinxiao13/fine-tuned-codet5')

text = "print hello"
task_prefix = "Generate Python code from natural language:"
input = tokenizer(
        [task_prefix + input],
        return_tensors="pt",
        padding='max_length',
        max_length=512,
        truncation=True
    )
input_ids = tokenizer(text, return_tensors="pt").input_ids

outputs = model.generate(
    input_ids=input['input_ids'],
    attention_mask=input["attention_mask"],
    do_sample=False,
    max_new_tokens=512
    )
generated_code = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(generated_code[0]) 

Pre-training data

The CodeT5 model was pretrained on CodeSearchNet Husain et al., 2019. Additionally, the authors collected two datasets of C/CSharp from BigQuery1 to ensure that all downstream tasks have overlapped programming languages with the pre-training data. In total, around 8.35 million instances are used for pretraining.

Fine-tuning data

The fine-tuned model was trained on conala-mined Yin et al., 2019, Re-sampled Python3.7 API Knowledge Xu et al., 2020 and MBPP (Mostly Basic Python Programming) Austin et al., 2021

Downloads last month
1
Inference Examples
Inference API (serverless) has been turned off for this model.