dnnsdunca commited on
Commit
de5fa47
1 Parent(s): c9be19e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -302
README.md CHANGED
@@ -1,302 +0,0 @@
1
- ---
2
- language: code
3
- thumbnail: https://cdn-media.huggingface.co/CodeBERTa/CodeBERTa.png
4
- datasets:
5
- - code_search_net
6
- license: apache-2.0
7
- base_model: huggingface/CodeBERTa-small-v1
8
- ---
9
-
10
- # CodeBERTa-language-id: The World’s fanciest programming language identification algo 🤯
11
-
12
-
13
- To demonstrate the usefulness of our CodeBERTa pretrained model on downstream tasks beyond language modeling, we fine-tune the [`CodeBERTa-small-v1`](https://huggingface.co/huggingface/CodeBERTa-small-v1) checkpoint on the task of classifying a sample of code into the programming language it's written in (*programming language identification*).
14
-
15
- We add a sequence classification head on top of the model.
16
-
17
- On the evaluation dataset, we attain an eval accuracy and F1 > 0.999 which is not surprising given that the task of language identification is relatively easy (see an intuition why, below).
18
-
19
- ## Quick start: using the raw model
20
-
21
- ```python
22
- CODEBERTA_LANGUAGE_ID = "huggingface/CodeBERTa-language-id"
23
-
24
- tokenizer = RobertaTokenizer.from_pretrained(CODEBERTA_LANGUAGE_ID)
25
- model = RobertaForSequenceClassification.from_pretrained(CODEBERTA_LANGUAGE_ID)
26
-
27
- input_ids = tokenizer.encode(CODE_TO_IDENTIFY)
28
- logits = model(input_ids)[0]
29
-
30
- language_idx = logits.argmax() # index for the resulting label
31
- ```
32
-
33
-
34
- ## Quick start: using Pipelines 💪
35
-
36
- ```python
37
- from transformers import TextClassificationPipeline
38
-
39
- pipeline = TextClassificationPipeline(
40
- model=RobertaForSequenceClassification.from_pretrained(CODEBERTA_LANGUAGE_ID),
41
- tokenizer=RobertaTokenizer.from_pretrained(CODEBERTA_LANGUAGE_ID)
42
- )
43
-
44
- pipeline(CODE_TO_IDENTIFY)
45
- ```
46
-
47
- Let's start with something very easy:
48
-
49
- ```python
50
- pipeline("""
51
- def f(x):
52
- return x**2
53
- """)
54
- # [{'label': 'python', 'score': 0.9999965}]
55
- ```
56
-
57
- Now let's probe shorter code samples:
58
-
59
- ```python
60
- pipeline("const foo = 'bar'")
61
- # [{'label': 'javascript', 'score': 0.9977546}]
62
- ```
63
-
64
- What if I remove the `const` token from the assignment?
65
- ```python
66
- pipeline("foo = 'bar'")
67
- # [{'label': 'javascript', 'score': 0.7176245}]
68
- ```
69
-
70
- For some reason, this is still statistically detected as JS code, even though it's also valid Python code. However, if we slightly tweak it:
71
-
72
- ```python
73
- pipeline("foo = u'bar'")
74
- # [{'label': 'python', 'score': 0.7638422}]
75
- ```
76
- This is now detected as Python (Notice the `u` string modifier).
77
-
78
- Okay, enough with the JS and Python domination already! Let's try fancier languages:
79
-
80
- ```python
81
- pipeline("echo $FOO")
82
- # [{'label': 'php', 'score': 0.9995257}]
83
- ```
84
-
85
- (Yes, I used the word "fancy" to describe PHP 😅)
86
-
87
- ```python
88
- pipeline("outcome := rand.Intn(6) + 1")
89
- # [{'label': 'go', 'score': 0.9936151}]
90
- ```
91
-
92
- Why is the problem of language identification so easy (with the correct toolkit)? Because code's syntax is rigid, and simple tokens such as `:=` (the assignment operator in Go) are perfect predictors of the underlying language:
93
-
94
- ```python
95
- pipeline(":=")
96
- # [{'label': 'go', 'score': 0.9998052}]
97
- ```
98
-
99
- By the way, because we trained our own custom tokenizer on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset, and it handles streams of bytes in a very generic way, syntactic constructs such `:=` are represented by a single token:
100
-
101
- ```python
102
- self.tokenizer.encode(" :=", add_special_tokens=False)
103
- # [521]
104
- ```
105
-
106
- <br>
107
-
108
- ## Fine-tuning code
109
-
110
- <details>
111
-
112
- ```python
113
- import gzip
114
- import json
115
- import logging
116
- import os
117
- from pathlib import Path
118
- from typing import Dict, List, Tuple
119
-
120
- import numpy as np
121
- import torch
122
- from sklearn.metrics import f1_score
123
- from tokenizers.implementations.byte_level_bpe import ByteLevelBPETokenizer
124
- from tokenizers.processors import BertProcessing
125
- from torch.nn.utils.rnn import pad_sequence
126
- from torch.utils.data import DataLoader, Dataset
127
- from torch.utils.data.dataset import Dataset
128
- from torch.utils.tensorboard.writer import SummaryWriter
129
- from tqdm import tqdm, trange
130
-
131
- from transformers import RobertaForSequenceClassification
132
- from transformers.data.metrics import acc_and_f1, simple_accuracy
133
-
134
-
135
- logging.basicConfig(level=logging.INFO)
136
-
137
-
138
- CODEBERTA_PRETRAINED = "huggingface/CodeBERTa-small-v1"
139
-
140
- LANGUAGES = [
141
- "go",
142
- "java",
143
- "javascript",
144
- "php",
145
- "python",
146
- "ruby",
147
- ]
148
- FILES_PER_LANGUAGE = 1
149
- EVALUATE = True
150
-
151
- # Set up tokenizer
152
- tokenizer = ByteLevelBPETokenizer("./pretrained/vocab.json", "./pretrained/merges.txt",)
153
- tokenizer._tokenizer.post_processor = BertProcessing(
154
- ("</s>", tokenizer.token_to_id("</s>")), ("<s>", tokenizer.token_to_id("<s>")),
155
- )
156
- tokenizer.enable_truncation(max_length=512)
157
-
158
- # Set up Tensorboard
159
- tb_writer = SummaryWriter()
160
-
161
-
162
- class CodeSearchNetDataset(Dataset):
163
- examples: List[Tuple[List[int], int]]
164
-
165
- def __init__(self, split: str = "train"):
166
- """
167
- train | valid | test
168
- """
169
-
170
- self.examples = []
171
-
172
- src_files = []
173
- for language in LANGUAGES:
174
- src_files += list(
175
- Path("../CodeSearchNet/resources/data/").glob(f"{language}/final/jsonl/{split}/*.jsonl.gz")
176
- )[:FILES_PER_LANGUAGE]
177
- for src_file in src_files:
178
- label = src_file.parents[3].name
179
- label_idx = LANGUAGES.index(label)
180
- print("🔥", src_file, label)
181
- lines = []
182
- fh = gzip.open(src_file, mode="rt", encoding="utf-8")
183
- for line in fh:
184
- o = json.loads(line)
185
- lines.append(o["code"])
186
- examples = [(x.ids, label_idx) for x in tokenizer.encode_batch(lines)]
187
- self.examples += examples
188
- print("🔥🔥")
189
-
190
- def __len__(self):
191
- return len(self.examples)
192
-
193
- def __getitem__(self, i):
194
- # We’ll pad at the batch level.
195
- return self.examples[i]
196
-
197
-
198
- model = RobertaForSequenceClassification.from_pretrained(CODEBERTA_PRETRAINED, num_labels=len(LANGUAGES))
199
-
200
- train_dataset = CodeSearchNetDataset(split="train")
201
- eval_dataset = CodeSearchNetDataset(split="test")
202
-
203
-
204
- def collate(examples):
205
- input_ids = pad_sequence([torch.tensor(x[0]) for x in examples], batch_first=True, padding_value=1)
206
- labels = torch.tensor([x[1] for x in examples])
207
- # ^^ uncessary .unsqueeze(-1)
208
- return input_ids, labels
209
-
210
-
211
- train_dataloader = DataLoader(train_dataset, batch_size=256, shuffle=True, collate_fn=collate)
212
-
213
- batch = next(iter(train_dataloader))
214
-
215
-
216
- model.to("cuda")
217
- model.train()
218
- for param in model.roberta.parameters():
219
- param.requires_grad = False
220
- ## ^^ Only train final layer.
221
-
222
- print(f"num params:", model.num_parameters())
223
- print(f"num trainable params:", model.num_parameters(only_trainable=True))
224
-
225
-
226
- def evaluate():
227
- eval_loss = 0.0
228
- nb_eval_steps = 0
229
- preds = np.empty((0), dtype=np.int64)
230
- out_label_ids = np.empty((0), dtype=np.int64)
231
-
232
- model.eval()
233
-
234
- eval_dataloader = DataLoader(eval_dataset, batch_size=512, collate_fn=collate)
235
- for step, (input_ids, labels) in enumerate(tqdm(eval_dataloader, desc="Eval")):
236
- with torch.no_grad():
237
- outputs = model(input_ids=input_ids.to("cuda"), labels=labels.to("cuda"))
238
- loss = outputs[0]
239
- logits = outputs[1]
240
- eval_loss += loss.mean().item()
241
- nb_eval_steps += 1
242
- preds = np.append(preds, logits.argmax(dim=1).detach().cpu().numpy(), axis=0)
243
- out_label_ids = np.append(out_label_ids, labels.detach().cpu().numpy(), axis=0)
244
- eval_loss = eval_loss / nb_eval_steps
245
- acc = simple_accuracy(preds, out_label_ids)
246
- f1 = f1_score(y_true=out_label_ids, y_pred=preds, average="macro")
247
- print("=== Eval: loss ===", eval_loss)
248
- print("=== Eval: acc. ===", acc)
249
- print("=== Eval: f1 ===", f1)
250
- # print(acc_and_f1(preds, out_label_ids))
251
- tb_writer.add_scalars("eval", {"loss": eval_loss, "acc": acc, "f1": f1}, global_step)
252
-
253
-
254
- ### Training loop
255
-
256
- global_step = 0
257
- train_iterator = trange(0, 4, desc="Epoch")
258
- optimizer = torch.optim.AdamW(model.parameters())
259
- for _ in train_iterator:
260
- epoch_iterator = tqdm(train_dataloader, desc="Iteration")
261
- for step, (input_ids, labels) in enumerate(epoch_iterator):
262
- optimizer.zero_grad()
263
- outputs = model(input_ids=input_ids.to("cuda"), labels=labels.to("cuda"))
264
- loss = outputs[0]
265
- loss.backward()
266
- tb_writer.add_scalar("training_loss", loss.item(), global_step)
267
- optimizer.step()
268
- global_step += 1
269
- if EVALUATE and global_step % 50 == 0:
270
- evaluate()
271
- model.train()
272
-
273
-
274
- evaluate()
275
-
276
- os.makedirs("./models/CodeBERT-language-id", exist_ok=True)
277
- model.save_pretrained("./models/CodeBERT-language-id")
278
- ```
279
-
280
- </details>
281
-
282
- <br>
283
-
284
- ## CodeSearchNet citation
285
-
286
- <details>
287
-
288
- ```bibtex
289
- @article{husain_codesearchnet_2019,
290
- title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
291
- shorttitle = {{CodeSearchNet} {Challenge}},
292
- url = {http://arxiv.org/abs/1909.09436},
293
- urldate = {2020-03-12},
294
- journal = {arXiv:1909.09436 [cs, stat]},
295
- author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
296
- month = sep,
297
- year = {2019},
298
- note = {arXiv: 1909.09436},
299
- }
300
- ```
301
-
302
- </details>