AIRI-Institute
/

gena-lm-bert-base-lastln-t2t

@@ -4,13 +4,13 @@ tags:
 - human_genome
 ---
-# GENA-LM (gena-lm-bert-base-lastln-t2t)
 GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.
 GENA-LM models are transformer masked language models trained on human DNA sequence.
-Differences between GENA-LM (`gena-lm-bert-base-lastln-t2t`) and DNABERT:
 - BPE tokenization instead of k-mers;
 - input sequence size is about 4500 nucleotides (512 BPE tokens) compared to 512 nucleotides of DNABERT
 - pre-training on T2T vs. GRCh38.p13 human genome assembly.
@@ -21,32 +21,56 @@ Paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1
 ## Examples
-### Load pre-trained model
 ```python
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-lastln-t2t')
-model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-lastln-t2t')
 ```
-### How to load the model to fine-tune it on classification task
 ```python
-from src.gena_lm.modeling_bert import BertForSequenceClassification
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-lastln-t2t')
 model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bert-base-lastln-t2t')
 ```
 ## Model description
-GENA-LM (`gena-lm-bert-base-lastln-t2t`) model is trained in a masked language model (MLM) fashion, following the methods proposed in the BigBird paper by masking 15% of tokens. Model config for `gena-lm-bert-base-lastln-t2t` is similar to the bert-base:
 - 512 Maximum sequence length
 - 12 Layers, 12 Attention heads
 - 768 Hidden size
 - 32k Vocabulary size
-We pre-trained `gena-lm-bert-base-lastln-t2t` using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). The data was augmented by sampling mutations from 1000-genome SNPs (gnomAD dataset). Pre-training was performed for 2,100,000 iterations with batch size 256 and sequence length was equal to 512 tokens. We modified Transformer with [Pre-Layer normalization](https://arxiv.org/abs/2002.04745).
 ## Evaluation
 For evaluation results, see our paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1

 - human_genome
 ---
+# GENA-LM (gena-lm-bert-base-lastln-t2t-lastln-t2t)
 GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.
 GENA-LM models are transformer masked language models trained on human DNA sequence.
+Differences between GENA-LM (`gena-lm-bert-base-lastln-t2t-lastln-t2t`) and DNABERT:
 - BPE tokenization instead of k-mers;
 - input sequence size is about 4500 nucleotides (512 BPE tokens) compared to 512 nucleotides of DNABERT
 - pre-training on T2T vs. GRCh38.p13 human genome assembly.
 ## Examples
+### How to load pre-trained model for Masked Language Modeling
 ```python
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-lastln-t2t')
+model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-lastln-t2t', trust_remote_code=True)
+```
+### How to load pre-trained model to fine-tune it on classification task
+Get model class from GENA-LM repository:
+```bash
+git clone https://github.com/AIRI-Institute/GENA_LM.git
 ```
 ```python
+from GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-lastln-t2t')
 model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bert-base-lastln-t2t')
 ```
+or you can just download [modeling_bert.py](https://github.com/AIRI-Institute/GENA_LM/tree/main/src/gena_lm) and put it close to your code.
+OR you can get model class from HuggingFace AutoModel:
+```python
+from transformers import AutoTokenizer, AutoModel
+model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-lastln-t2t', trust_remote_code=True)
+gena_module_name = model.__class__.__module__
+print(gena_module_name)
+import importlib
+# available class names:
+# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
+# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
+# - BertForQuestionAnswering
+# check https://huggingface.co/docs/transformers/model_doc/bert
+cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
+print(cls)
+model = cls.from_pretrained('AIRI-Institute/gena-lm-bert-base-lastln-t2t', num_labels=2)
+```
 ## Model description
+GENA-LM (`gena-lm-bert-base-lastln-t2t-lastln-t2t`) model is trained in a masked language model (MLM) fashion, following the methods proposed in the BigBird paper by masking 15% of tokens. Model config for `gena-lm-bert-base-lastln-t2t-lastln-t2t` is similar to the bert-base:
 - 512 Maximum sequence length
 - 12 Layers, 12 Attention heads
 - 768 Hidden size
 - 32k Vocabulary size
+We pre-trained `gena-lm-bert-base-lastln-t2t-lastln-t2t` using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). The data was augmented by sampling mutations from 1000-genome SNPs (gnomAD dataset). Pre-training was performed for 2,100,000 iterations with batch size 256 and sequence length was equal to 512 tokens. We modified Transformer with [Pre-Layer normalization](https://arxiv.org/abs/2002.04745).
 ## Evaluation
 For evaluation results, see our paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1