README.md · LofiAmazon/BarcodeBERT-Entire-BOLD at 18d7275dbef3812ee927d158ea3135f094b1268a

metadata

license: mit

BarcodeBERT model trained on all complete DNA sequences from the latest BOLD database release. We used the 'nucraw' column of DNA sequences and followed the preprocessing steps outlined in the BarcodeBERT paper.

The model has been trained for a total of 17 epochs.

Example Usage

from transformers import PreTrainedTokenizerFast, BertForMaskedLM

model = BertForMaskedLM.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD")
model.eval()

tokenizer = PreTrainedTokenizerFast.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD")

# The DNA sequence you want to predict.
# There should be a space after every 4 characters.
# The sequence may also have unknown characters which are not A,C,T,G.
# The maximum DNA sequence length (not counting spaces) should be 660 characters
dna_sequence = "AACA ATGT ATTT A-T- TTCG CCCT TGTG AATT TATT ..."

inputs = tokenizer(dna_sequence, return_tensors="pt")

# Obtain a DNA embedding, which is a vector of length 768.
# The embedding is a representation of this DNA sequence in the model's latent space.
embedding = model(**inputs).hidden_states[-1].mean(1).squeeze()

LofiAmazon
/

BarcodeBERT-Entire-BOLD

Example Usage

Results