dnnsdunca commited on
Commit
c9be19e
1 Parent(s): b00bdac

Upload 8 files

Browse files
Files changed (8) hide show
  1. README.md +300 -6
  2. config.json +38 -0
  3. flax_model.msgpack +3 -0
  4. gitattributes +9 -0
  5. merges.txt +0 -0
  6. tf_model.h5 +3 -0
  7. tokenizer_config.json +3 -0
  8. vocab.json +0 -0
README.md CHANGED
@@ -1,8 +1,302 @@
1
  ---
 
 
 
 
2
  license: apache-2.0
3
- language:
4
- - en
5
- library_name: adapter-transformers
6
- tags:
7
- - code
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: code
3
+ thumbnail: https://cdn-media.huggingface.co/CodeBERTa/CodeBERTa.png
4
+ datasets:
5
+ - code_search_net
6
  license: apache-2.0
7
+ base_model: huggingface/CodeBERTa-small-v1
8
+ ---
9
+
10
+ # CodeBERTa-language-id: The World’s fanciest programming language identification algo 🤯
11
+
12
+
13
+ To demonstrate the usefulness of our CodeBERTa pretrained model on downstream tasks beyond language modeling, we fine-tune the [`CodeBERTa-small-v1`](https://huggingface.co/huggingface/CodeBERTa-small-v1) checkpoint on the task of classifying a sample of code into the programming language it's written in (*programming language identification*).
14
+
15
+ We add a sequence classification head on top of the model.
16
+
17
+ On the evaluation dataset, we attain an eval accuracy and F1 > 0.999 which is not surprising given that the task of language identification is relatively easy (see an intuition why, below).
18
+
19
+ ## Quick start: using the raw model
20
+
21
+ ```python
22
+ CODEBERTA_LANGUAGE_ID = "huggingface/CodeBERTa-language-id"
23
+
24
+ tokenizer = RobertaTokenizer.from_pretrained(CODEBERTA_LANGUAGE_ID)
25
+ model = RobertaForSequenceClassification.from_pretrained(CODEBERTA_LANGUAGE_ID)
26
+
27
+ input_ids = tokenizer.encode(CODE_TO_IDENTIFY)
28
+ logits = model(input_ids)[0]
29
+
30
+ language_idx = logits.argmax() # index for the resulting label
31
+ ```
32
+
33
+
34
+ ## Quick start: using Pipelines 💪
35
+
36
+ ```python
37
+ from transformers import TextClassificationPipeline
38
+
39
+ pipeline = TextClassificationPipeline(
40
+ model=RobertaForSequenceClassification.from_pretrained(CODEBERTA_LANGUAGE_ID),
41
+ tokenizer=RobertaTokenizer.from_pretrained(CODEBERTA_LANGUAGE_ID)
42
+ )
43
+
44
+ pipeline(CODE_TO_IDENTIFY)
45
+ ```
46
+
47
+ Let's start with something very easy:
48
+
49
+ ```python
50
+ pipeline("""
51
+ def f(x):
52
+ return x**2
53
+ """)
54
+ # [{'label': 'python', 'score': 0.9999965}]
55
+ ```
56
+
57
+ Now let's probe shorter code samples:
58
+
59
+ ```python
60
+ pipeline("const foo = 'bar'")
61
+ # [{'label': 'javascript', 'score': 0.9977546}]
62
+ ```
63
+
64
+ What if I remove the `const` token from the assignment?
65
+ ```python
66
+ pipeline("foo = 'bar'")
67
+ # [{'label': 'javascript', 'score': 0.7176245}]
68
+ ```
69
+
70
+ For some reason, this is still statistically detected as JS code, even though it's also valid Python code. However, if we slightly tweak it:
71
+
72
+ ```python
73
+ pipeline("foo = u'bar'")
74
+ # [{'label': 'python', 'score': 0.7638422}]
75
+ ```
76
+ This is now detected as Python (Notice the `u` string modifier).
77
+
78
+ Okay, enough with the JS and Python domination already! Let's try fancier languages:
79
+
80
+ ```python
81
+ pipeline("echo $FOO")
82
+ # [{'label': 'php', 'score': 0.9995257}]
83
+ ```
84
+
85
+ (Yes, I used the word "fancy" to describe PHP 😅)
86
+
87
+ ```python
88
+ pipeline("outcome := rand.Intn(6) + 1")
89
+ # [{'label': 'go', 'score': 0.9936151}]
90
+ ```
91
+
92
+ Why is the problem of language identification so easy (with the correct toolkit)? Because code's syntax is rigid, and simple tokens such as `:=` (the assignment operator in Go) are perfect predictors of the underlying language:
93
+
94
+ ```python
95
+ pipeline(":=")
96
+ # [{'label': 'go', 'score': 0.9998052}]
97
+ ```
98
+
99
+ By the way, because we trained our own custom tokenizer on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset, and it handles streams of bytes in a very generic way, syntactic constructs such `:=` are represented by a single token:
100
+
101
+ ```python
102
+ self.tokenizer.encode(" :=", add_special_tokens=False)
103
+ # [521]
104
+ ```
105
+
106
+ <br>
107
+
108
+ ## Fine-tuning code
109
+
110
+ <details>
111
+
112
+ ```python
113
+ import gzip
114
+ import json
115
+ import logging
116
+ import os
117
+ from pathlib import Path
118
+ from typing import Dict, List, Tuple
119
+
120
+ import numpy as np
121
+ import torch
122
+ from sklearn.metrics import f1_score
123
+ from tokenizers.implementations.byte_level_bpe import ByteLevelBPETokenizer
124
+ from tokenizers.processors import BertProcessing
125
+ from torch.nn.utils.rnn import pad_sequence
126
+ from torch.utils.data import DataLoader, Dataset
127
+ from torch.utils.data.dataset import Dataset
128
+ from torch.utils.tensorboard.writer import SummaryWriter
129
+ from tqdm import tqdm, trange
130
+
131
+ from transformers import RobertaForSequenceClassification
132
+ from transformers.data.metrics import acc_and_f1, simple_accuracy
133
+
134
+
135
+ logging.basicConfig(level=logging.INFO)
136
+
137
+
138
+ CODEBERTA_PRETRAINED = "huggingface/CodeBERTa-small-v1"
139
+
140
+ LANGUAGES = [
141
+ "go",
142
+ "java",
143
+ "javascript",
144
+ "php",
145
+ "python",
146
+ "ruby",
147
+ ]
148
+ FILES_PER_LANGUAGE = 1
149
+ EVALUATE = True
150
+
151
+ # Set up tokenizer
152
+ tokenizer = ByteLevelBPETokenizer("./pretrained/vocab.json", "./pretrained/merges.txt",)
153
+ tokenizer._tokenizer.post_processor = BertProcessing(
154
+ ("</s>", tokenizer.token_to_id("</s>")), ("<s>", tokenizer.token_to_id("<s>")),
155
+ )
156
+ tokenizer.enable_truncation(max_length=512)
157
+
158
+ # Set up Tensorboard
159
+ tb_writer = SummaryWriter()
160
+
161
+
162
+ class CodeSearchNetDataset(Dataset):
163
+ examples: List[Tuple[List[int], int]]
164
+
165
+ def __init__(self, split: str = "train"):
166
+ """
167
+ train | valid | test
168
+ """
169
+
170
+ self.examples = []
171
+
172
+ src_files = []
173
+ for language in LANGUAGES:
174
+ src_files += list(
175
+ Path("../CodeSearchNet/resources/data/").glob(f"{language}/final/jsonl/{split}/*.jsonl.gz")
176
+ )[:FILES_PER_LANGUAGE]
177
+ for src_file in src_files:
178
+ label = src_file.parents[3].name
179
+ label_idx = LANGUAGES.index(label)
180
+ print("🔥", src_file, label)
181
+ lines = []
182
+ fh = gzip.open(src_file, mode="rt", encoding="utf-8")
183
+ for line in fh:
184
+ o = json.loads(line)
185
+ lines.append(o["code"])
186
+ examples = [(x.ids, label_idx) for x in tokenizer.encode_batch(lines)]
187
+ self.examples += examples
188
+ print("🔥🔥")
189
+
190
+ def __len__(self):
191
+ return len(self.examples)
192
+
193
+ def __getitem__(self, i):
194
+ # We’ll pad at the batch level.
195
+ return self.examples[i]
196
+
197
+
198
+ model = RobertaForSequenceClassification.from_pretrained(CODEBERTA_PRETRAINED, num_labels=len(LANGUAGES))
199
+
200
+ train_dataset = CodeSearchNetDataset(split="train")
201
+ eval_dataset = CodeSearchNetDataset(split="test")
202
+
203
+
204
+ def collate(examples):
205
+ input_ids = pad_sequence([torch.tensor(x[0]) for x in examples], batch_first=True, padding_value=1)
206
+ labels = torch.tensor([x[1] for x in examples])
207
+ # ^^ uncessary .unsqueeze(-1)
208
+ return input_ids, labels
209
+
210
+
211
+ train_dataloader = DataLoader(train_dataset, batch_size=256, shuffle=True, collate_fn=collate)
212
+
213
+ batch = next(iter(train_dataloader))
214
+
215
+
216
+ model.to("cuda")
217
+ model.train()
218
+ for param in model.roberta.parameters():
219
+ param.requires_grad = False
220
+ ## ^^ Only train final layer.
221
+
222
+ print(f"num params:", model.num_parameters())
223
+ print(f"num trainable params:", model.num_parameters(only_trainable=True))
224
+
225
+
226
+ def evaluate():
227
+ eval_loss = 0.0
228
+ nb_eval_steps = 0
229
+ preds = np.empty((0), dtype=np.int64)
230
+ out_label_ids = np.empty((0), dtype=np.int64)
231
+
232
+ model.eval()
233
+
234
+ eval_dataloader = DataLoader(eval_dataset, batch_size=512, collate_fn=collate)
235
+ for step, (input_ids, labels) in enumerate(tqdm(eval_dataloader, desc="Eval")):
236
+ with torch.no_grad():
237
+ outputs = model(input_ids=input_ids.to("cuda"), labels=labels.to("cuda"))
238
+ loss = outputs[0]
239
+ logits = outputs[1]
240
+ eval_loss += loss.mean().item()
241
+ nb_eval_steps += 1
242
+ preds = np.append(preds, logits.argmax(dim=1).detach().cpu().numpy(), axis=0)
243
+ out_label_ids = np.append(out_label_ids, labels.detach().cpu().numpy(), axis=0)
244
+ eval_loss = eval_loss / nb_eval_steps
245
+ acc = simple_accuracy(preds, out_label_ids)
246
+ f1 = f1_score(y_true=out_label_ids, y_pred=preds, average="macro")
247
+ print("=== Eval: loss ===", eval_loss)
248
+ print("=== Eval: acc. ===", acc)
249
+ print("=== Eval: f1 ===", f1)
250
+ # print(acc_and_f1(preds, out_label_ids))
251
+ tb_writer.add_scalars("eval", {"loss": eval_loss, "acc": acc, "f1": f1}, global_step)
252
+
253
+
254
+ ### Training loop
255
+
256
+ global_step = 0
257
+ train_iterator = trange(0, 4, desc="Epoch")
258
+ optimizer = torch.optim.AdamW(model.parameters())
259
+ for _ in train_iterator:
260
+ epoch_iterator = tqdm(train_dataloader, desc="Iteration")
261
+ for step, (input_ids, labels) in enumerate(epoch_iterator):
262
+ optimizer.zero_grad()
263
+ outputs = model(input_ids=input_ids.to("cuda"), labels=labels.to("cuda"))
264
+ loss = outputs[0]
265
+ loss.backward()
266
+ tb_writer.add_scalar("training_loss", loss.item(), global_step)
267
+ optimizer.step()
268
+ global_step += 1
269
+ if EVALUATE and global_step % 50 == 0:
270
+ evaluate()
271
+ model.train()
272
+
273
+
274
+ evaluate()
275
+
276
+ os.makedirs("./models/CodeBERT-language-id", exist_ok=True)
277
+ model.save_pretrained("./models/CodeBERT-language-id")
278
+ ```
279
+
280
+ </details>
281
+
282
+ <br>
283
+
284
+ ## CodeSearchNet citation
285
+
286
+ <details>
287
+
288
+ ```bibtex
289
+ @article{husain_codesearchnet_2019,
290
+ title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
291
+ shorttitle = {{CodeSearchNet} {Challenge}},
292
+ url = {http://arxiv.org/abs/1909.09436},
293
+ urldate = {2020-03-12},
294
+ journal = {arXiv:1909.09436 [cs, stat]},
295
+ author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
296
+ month = sep,
297
+ year = {2019},
298
+ note = {arXiv: 1909.09436},
299
+ }
300
+ ```
301
+
302
+ </details>
config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_num_labels": 6,
3
+ "architectures": [
4
+ "RobertaForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "go",
14
+ "1": "java",
15
+ "2": "javascript",
16
+ "3": "php",
17
+ "4": "python",
18
+ "5": "ruby"
19
+ },
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 3072,
22
+ "label2id": {
23
+ "go": 0,
24
+ "java": 1,
25
+ "javascript": 2,
26
+ "php": 3,
27
+ "python": 4,
28
+ "ruby": 5
29
+ },
30
+ "layer_norm_eps": 1e-05,
31
+ "max_position_embeddings": 514,
32
+ "model_type": "roberta",
33
+ "num_attention_heads": 12,
34
+ "num_hidden_layers": 6,
35
+ "pad_token_id": 1,
36
+ "type_vocab_size": 1,
37
+ "vocab_size": 52000
38
+ }
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8407f0ac9f759ced96ae40c366db6a03b3a5a725dd915252c9456b3945537eb2
3
+ size 333825787
gitattributes ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.h5 filter=lfs diff=lfs merge=lfs -text
5
+ *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9114ffaf2db4344a1e463f0713177ab811b8c141746090144a9e3e8b52155890
3
+ size 333971544
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "max_len": 512
3
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff