Duplicate from michiyasunaga/BioLinkBERT-large

Browse files

Co-authored-by: Michihiro Yasunaga <[email protected]>

Files changed (8) hide show

.gitattributes +27 -0
README.md +87 -0
config.json +23 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,27 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zstandard filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,87 @@

+---
+license: apache-2.0
+language: en
+datasets:
+- pubmed
+tags:
+- bert
+- exbert
+- linkbert
+- biolinkbert
+- feature-extraction
+- fill-mask
+- question-answering
+- text-classification
+- token-classification
+widget:
+- text: Sunitinib is a tyrosine kinase inhibitor
+duplicated_from: michiyasunaga/BioLinkBERT-large
+---
+## BioLinkBERT-large
+BioLinkBERT-large model pretrained on [PubMed](https://pubmed.ncbi.nlm.nih.gov/) abstracts along with citation link information. It is introduced in the paper [LinkBERT: Pretraining Language Models with Document Links (ACL 2022)](https://arxiv.org/abs/2203.15827). The code and data are available in [this repository](https://github.com/michiyasunaga/LinkBERT).
+This model achieves state-of-the-art performance on several biomedical NLP benchmarks such as [BLURB](https://microsoft.github.io/BLURB/) and [MedQA-USMLE](https://github.com/jind11/MedQA).
+## Model description
+LinkBERT is a transformer encoder (BERT-like) model pretrained on a large corpus of documents. It is an improvement of BERT that newly captures **document links** such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides a single document.
+LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for **knowledge-intensive** tasks (e.g. question answering) and **cross-document** tasks (e.g. reading comprehension, document retrieval).
+## Intended uses & limitations
+The model can be used by fine-tuning on a downstream task, such as question answering, sequence classification, and token classification.
+You can also use the raw model for feature extraction (i.e. obtaining embeddings for input text).
+### How to use
+To use the model to get the features of a given text in PyTorch:
+```python
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/BioLinkBERT-large')
+model = AutoModel.from_pretrained('michiyasunaga/BioLinkBERT-large')
+inputs = tokenizer("Sunitinib is a tyrosine kinase inhibitor", return_tensors="pt")
+outputs = model(**inputs)
+last_hidden_states = outputs.last_hidden_state
+```
+For fine-tuning, you can use [this repository](https://github.com/michiyasunaga/LinkBERT) or follow any other BERT fine-tuning codebases.
+## Evaluation results
+When fine-tuned on downstream tasks, LinkBERT achieves the following results.
+**Biomedical benchmarks ([BLURB](https://microsoft.github.io/BLURB/), [MedQA](https://github.com/jind11/MedQA), [MMLU](https://github.com/hendrycks/test), etc.):** BioLinkBERT attains new state-of-the-art.
+|                         | BLURB score | PubMedQA | BioASQ   | MedQA-USMLE |
+| ----------------------  | --------    | -------- | -------  | --------    |
+| PubmedBERT-base         | 81.10       | 55.8     | 87.5     | 38.1        |
+| **BioLinkBERT-base**    | **83.39**   | **70.2** | **91.4** | **40.0** |
+| **BioLinkBERT-large**   | **84.30**   | **72.2** | **94.8** | **44.6** |
+|                         | MMLU-professional medicine     |
+| ----------------------  | --------  |
+| GPT-3 (175 params)      | 38.7      |
+| UnifiedQA (11B params)  | 43.2      |
+| **BioLinkBERT-large (340M params)** | **50.7**  |
+## Citation
+If you find LinkBERT useful in your project, please cite the following:
+```bibtex
+@InProceedings{yasunaga2022linkbert,
+  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
+  title =   {LinkBERT: Pretraining Language Models with Document Links},
+  year =    {2022},
+  booktitle = {Association for Computational Linguistics (ACL)},
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.9.0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 28895
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fed75e5716547b54198d4dd123e7a3f3c64a82e1172b3492a11deebd6ab4cd4d
+size 1334073393

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "BertTokenizer"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff