|
--- |
|
tags: |
|
- biology |
|
- DNA |
|
- genomics |
|
--- |
|
This is the official pre-trained model introduced in [GROVER : A foundation DNA language with optimized vocabulary learns sequence context in the human genome](https://www.biorxiv.org/content/10.1101/2023.07.19.549677v2) |
|
|
|
|
|
|
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
import torch |
|
|
|
# Import the tokenizer and the model |
|
tokenizer = AutoTokenizer.from_pretrained("PoetschLab/GROVER") |
|
model = AutoModelForMaskedLM.from_pretrained("PoetschLab/GROVER") |
|
|
|
|
|
Some preliminary analysis shows that sequence re-tokenization using Byte Pair Encoding (BPE) changes significantly if the sequence is less than 50 nucleotides long. Longer than 50 nucleotides, you should still be careful with sequence edges. |
|
We advice to add 100 nucleotides at the beginning and end of every sequence in order to garantee that your sequence is represented with the same tokens as the original tokenization. |
|
We also provide the tokenized chromosomes with their respective nucleotide mappers (They are available in the folder tokenized chromosomes). |
|
|