Upload 8 files

Browse files

Files changed (8) hide show

README.md +300 -6
config.json +38 -0
flax_model.msgpack +3 -0
gitattributes +9 -0
merges.txt +0 -0
tf_model.h5 +3 -0
tokenizer_config.json +3 -0
vocab.json +0 -0

README.md CHANGED Viewed

@@ -1,8 +1,302 @@
 ---
 license: apache-2.0
-language:
-- en
-library_name: adapter-transformers
-tags:
-- code
----

 ---
+language: code
+thumbnail: https://cdn-media.huggingface.co/CodeBERTa/CodeBERTa.png
+datasets:
+- code_search_net
 license: apache-2.0
+base_model: huggingface/CodeBERTa-small-v1
+---
+# CodeBERTa-language-id: The World’s fanciest programming language identification algo 🤯
+To demonstrate the usefulness of our CodeBERTa pretrained model on downstream tasks beyond language modeling, we fine-tune the [`CodeBERTa-small-v1`](https://huggingface.co/huggingface/CodeBERTa-small-v1) checkpoint on the task of classifying a sample of code into the programming language it's written in (*programming language identification*).
+We add a sequence classification head on top of the model.
+On the evaluation dataset, we attain an eval accuracy and F1 > 0.999 which is not surprising given that the task of language identification is relatively easy (see an intuition why, below).
+## Quick start: using the raw model
+```python
+CODEBERTA_LANGUAGE_ID = "huggingface/CodeBERTa-language-id"
+tokenizer = RobertaTokenizer.from_pretrained(CODEBERTA_LANGUAGE_ID)
+model = RobertaForSequenceClassification.from_pretrained(CODEBERTA_LANGUAGE_ID)
+input_ids = tokenizer.encode(CODE_TO_IDENTIFY)
+logits = model(input_ids)[0]
+language_idx = logits.argmax() # index for the resulting label
+```
+## Quick start: using Pipelines 💪
+```python
+from transformers import TextClassificationPipeline
+pipeline = TextClassificationPipeline(
+    model=RobertaForSequenceClassification.from_pretrained(CODEBERTA_LANGUAGE_ID),
+    tokenizer=RobertaTokenizer.from_pretrained(CODEBERTA_LANGUAGE_ID)
+)
+pipeline(CODE_TO_IDENTIFY)
+```
+Let's start with something very easy:
+```python
+pipeline("""
+def f(x):
+    return x**2
+""")
+# [{'label': 'python', 'score': 0.9999965}]
+```
+Now let's probe shorter code samples:
+```python
+pipeline("const foo = 'bar'")
+# [{'label': 'javascript', 'score': 0.9977546}]
+```
+What if I remove the `const` token from the assignment?
+```python
+pipeline("foo = 'bar'")
+# [{'label': 'javascript', 'score': 0.7176245}]
+```
+For some reason, this is still statistically detected as JS code, even though it's also valid Python code. However, if we slightly tweak it:
+```python
+pipeline("foo = u'bar'")
+# [{'label': 'python', 'score': 0.7638422}]
+```
+This is now detected as Python (Notice the `u` string modifier).
+Okay, enough with the JS and Python domination already! Let's try fancier languages:
+```python
+pipeline("echo $FOO")
+# [{'label': 'php', 'score': 0.9995257}]
+```
+(Yes, I used the word "fancy" to describe PHP 😅)
+```python
+pipeline("outcome := rand.Intn(6) + 1")
+# [{'label': 'go', 'score': 0.9936151}]
+```
+Why is the problem of language identification so easy (with the correct toolkit)? Because code's syntax is rigid, and simple tokens such as `:=` (the assignment operator in Go) are perfect predictors of the underlying language:
+```python
+pipeline(":=")
+# [{'label': 'go', 'score': 0.9998052}]
+```
+By the way, because we trained our own custom tokenizer on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset, and it handles streams of bytes in a very generic way, syntactic constructs such `:=` are represented by a single token:
+```python
+self.tokenizer.encode(" :=", add_special_tokens=False)
+# [521]
+```
+<br>
+## Fine-tuning code
+<details>
+```python
+import gzip
+import json
+import logging
+import os
+from pathlib import Path
+from typing import Dict, List, Tuple
+import numpy as np
+import torch
+from sklearn.metrics import f1_score
+from tokenizers.implementations.byte_level_bpe import ByteLevelBPETokenizer
+from tokenizers.processors import BertProcessing
+from torch.nn.utils.rnn import pad_sequence
+from torch.utils.data import DataLoader, Dataset
+from torch.utils.data.dataset import Dataset
+from torch.utils.tensorboard.writer import SummaryWriter
+from tqdm import tqdm, trange
+from transformers import RobertaForSequenceClassification
+from transformers.data.metrics import acc_and_f1, simple_accuracy
+logging.basicConfig(level=logging.INFO)
+CODEBERTA_PRETRAINED = "huggingface/CodeBERTa-small-v1"
+LANGUAGES = [
+    "go",
+    "java",
+    "javascript",
+    "php",
+    "python",
+    "ruby",
+]
+FILES_PER_LANGUAGE = 1
+EVALUATE = True
+# Set up tokenizer
+tokenizer = ByteLevelBPETokenizer("./pretrained/vocab.json", "./pretrained/merges.txt",)
+tokenizer._tokenizer.post_processor = BertProcessing(
+    ("</s>", tokenizer.token_to_id("</s>")), ("<s>", tokenizer.token_to_id("<s>")),
+)
+tokenizer.enable_truncation(max_length=512)
+# Set up Tensorboard
+tb_writer = SummaryWriter()
+class CodeSearchNetDataset(Dataset):
+    examples: List[Tuple[List[int], int]]
+    def __init__(self, split: str = "train"):
+        """
+        train | valid | test
+        """
+        self.examples = []
+        src_files = []
+        for language in LANGUAGES:
+            src_files += list(
+                Path("../CodeSearchNet/resources/data/").glob(f"{language}/final/jsonl/{split}/*.jsonl.gz")
+            )[:FILES_PER_LANGUAGE]
+        for src_file in src_files:
+            label = src_file.parents[3].name
+            label_idx = LANGUAGES.index(label)
+            print("🔥", src_file, label)
+            lines = []
+            fh = gzip.open(src_file, mode="rt", encoding="utf-8")
+            for line in fh:
+                o = json.loads(line)
+                lines.append(o["code"])
+            examples = [(x.ids, label_idx) for x in tokenizer.encode_batch(lines)]
+            self.examples += examples
+        print("🔥🔥")
+    def __len__(self):
+        return len(self.examples)
+    def __getitem__(self, i):
+        # We’ll pad at the batch level.
+        return self.examples[i]
+model = RobertaForSequenceClassification.from_pretrained(CODEBERTA_PRETRAINED, num_labels=len(LANGUAGES))
+train_dataset = CodeSearchNetDataset(split="train")
+eval_dataset = CodeSearchNetDataset(split="test")
+def collate(examples):
+    input_ids = pad_sequence([torch.tensor(x[0]) for x in examples], batch_first=True, padding_value=1)
+    labels = torch.tensor([x[1] for x in examples])
+    # ^^  uncessary .unsqueeze(-1)
+    return input_ids, labels
+train_dataloader = DataLoader(train_dataset, batch_size=256, shuffle=True, collate_fn=collate)
+batch = next(iter(train_dataloader))
+model.to("cuda")
+model.train()
+for param in model.roberta.parameters():
+    param.requires_grad = False
+## ^^ Only train final layer.
+print(f"num params:", model.num_parameters())
+print(f"num trainable params:", model.num_parameters(only_trainable=True))
+def evaluate():
+    eval_loss = 0.0
+    nb_eval_steps = 0
+    preds = np.empty((0), dtype=np.int64)
+    out_label_ids = np.empty((0), dtype=np.int64)
+    model.eval()
+    eval_dataloader = DataLoader(eval_dataset, batch_size=512, collate_fn=collate)
+    for step, (input_ids, labels) in enumerate(tqdm(eval_dataloader, desc="Eval")):
+        with torch.no_grad():
+            outputs = model(input_ids=input_ids.to("cuda"), labels=labels.to("cuda"))
+            loss = outputs[0]
+            logits = outputs[1]
+            eval_loss += loss.mean().item()
+            nb_eval_steps += 1
+        preds = np.append(preds, logits.argmax(dim=1).detach().cpu().numpy(), axis=0)
+        out_label_ids = np.append(out_label_ids, labels.detach().cpu().numpy(), axis=0)
+    eval_loss = eval_loss / nb_eval_steps
+    acc = simple_accuracy(preds, out_label_ids)
+    f1 = f1_score(y_true=out_label_ids, y_pred=preds, average="macro")
+    print("=== Eval: loss ===", eval_loss)
+    print("=== Eval: acc. ===", acc)
+    print("=== Eval: f1 ===", f1)
+    # print(acc_and_f1(preds, out_label_ids))
+    tb_writer.add_scalars("eval", {"loss": eval_loss, "acc": acc, "f1": f1}, global_step)
+### Training loop
+global_step = 0
+train_iterator = trange(0, 4, desc="Epoch")
+optimizer = torch.optim.AdamW(model.parameters())
+for _ in train_iterator:
+    epoch_iterator = tqdm(train_dataloader, desc="Iteration")
+    for step, (input_ids, labels) in enumerate(epoch_iterator):
+        optimizer.zero_grad()
+        outputs = model(input_ids=input_ids.to("cuda"), labels=labels.to("cuda"))
+        loss = outputs[0]
+        loss.backward()
+        tb_writer.add_scalar("training_loss", loss.item(), global_step)
+        optimizer.step()
+        global_step += 1
+        if EVALUATE and global_step % 50 == 0:
+            evaluate()
+            model.train()
+evaluate()
+os.makedirs("./models/CodeBERT-language-id", exist_ok=True)
+model.save_pretrained("./models/CodeBERT-language-id")
+```
+</details>
+<br>
+## CodeSearchNet citation
+<details>
+```bibtex
+@article{husain_codesearchnet_2019,
+	title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
+	shorttitle = {{CodeSearchNet} {Challenge}},
+	url = {http://arxiv.org/abs/1909.09436},
+	urldate = {2020-03-12},
+	journal = {arXiv:1909.09436 [cs, stat]},
+	author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
+	month = sep,
+	year = {2019},
+	note = {arXiv: 1909.09436},
+}
+```
+</details>

config.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+  "_num_labels": 6,
+  "architectures": [
+    "RobertaForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "go",
+    "1": "java",
+    "2": "javascript",
+    "3": "php",
+    "4": "python",
+    "5": "ruby"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "go": 0,
+    "java": 1,
+    "javascript": 2,
+    "php": 3,
+    "python": 4,
+    "ruby": 5
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 1,
+  "type_vocab_size": 1,
+  "vocab_size": 52000
+}

flax_model.msgpack ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8407f0ac9f759ced96ae40c366db6a03b3a5a725dd915252c9456b3945537eb2
+size 333825787

gitattributes ADDED Viewed

	@@ -0,0 +1,9 @@

+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tar.gz filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9114ffaf2db4344a1e463f0713177ab811b8c141746090144a9e3e8b52155890
+size 333971544

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+	"max_len": 512
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff