vshulev's picture
Update README.md
18d7275 verified
metadata
license: mit

BarcodeBERT model trained on all complete DNA sequences from the latest BOLD database release. We used the 'nucraw' column of DNA sequences and followed the preprocessing steps outlined in the BarcodeBERT paper.

The model has been trained for a total of 17 epochs.

Example Usage

from transformers import PreTrainedTokenizerFast, BertForMaskedLM

model = BertForMaskedLM.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD")
model.eval()

tokenizer = PreTrainedTokenizerFast.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD")

# The DNA sequence you want to predict.
# There should be a space after every 4 characters.
# The sequence may also have unknown characters which are not A,C,T,G.
# The maximum DNA sequence length (not counting spaces) should be 660 characters
dna_sequence = "AACA ATGT ATTT A-T- TTCG CCCT TGTG AATT TATT ..."

inputs = tokenizer(dna_sequence, return_tensors="pt")

# Obtain a DNA embedding, which is a vector of length 768.
# The embedding is a representation of this DNA sequence in the model's latent space.
embedding = model(**inputs).hidden_states[-1].mean(1).squeeze()

Results

image/png