|
--- |
|
license: mit |
|
--- |
|
|
|
[BarcodeBERT](https://arxiv.org/pdf/2311.02401) model trained on all complete DNA sequences from the latest [BOLD database release](http://www.boldsystems.org/index.php/datapackages/Latest). We used the 'nucraw' column of DNA sequences and followed the preprocessing steps outlined in the BarcodeBERT paper. |
|
|
|
The model has been trained for a total of 17 epochs. |
|
|
|
## Example Usage |
|
|
|
```py |
|
from transformers import PreTrainedTokenizerFast, BertForMaskedLM |
|
|
|
model = BertForMaskedLM.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD") |
|
model.eval() |
|
|
|
tokenizer = PreTrainedTokenizerFast.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD") |
|
|
|
# The DNA sequence you want to predict. |
|
# There should be a space after every 4 characters. |
|
# The sequence may also have unknown characters which are not A,C,T,G. |
|
# The maximum DNA sequence length (not counting spaces) should be 660 characters |
|
dna_sequence = "AACA ATGT ATTT A-T- TTCG CCCT TGTG AATT TATT ..." |
|
|
|
inputs = tokenizer(dna_sequence, return_tensors="pt") |
|
|
|
# Obtain a DNA embedding, which is a vector of length 768. |
|
# The embedding is a representation of this DNA sequence in the model's latent space. |
|
embedding = model(**inputs).hidden_states[-1].mean(1).squeeze() |
|
``` |
|
|
|
## Results |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65ec809e794d34d1a4379f1f/LpXuOJn7CXR_UnA8sFmK1.png) |