Leveraging Large Language Models for Metagenomic Analysis
Model Overview: The model presented in this paper is based on the RoBERTa transformer with a similar approach to optimize and find the best BigBird model for large gene sequence architecture. It is trained specifically on gene sequences. This model aims to uncover insights within metagenomic data and is evaluated on various tasks such as classification and sequence embedding.
Model Architecture:
- Base Model: BigBird transformer architecture
- Tokenizer: Custom K-mer Tokenizer with k-mer length of 6 and overlapping tokens
- Training: Trained on a diverse dataset of 497 genes from 2000 bacterial and archaeal genomes
- Embeddings: Generates sequence embeddings using both mean and max pooling of hidden states
Dataset: Details of the dataset will be shared in the supplementary materials of the paper. The dataset includes a comprehensive collection of gene sequences from various metagenomic sources.
Steps to Use the Model:
Install KmerTokenizer:
pip install git+https://github.com/MsAlEhR/KmerTokenizer.git
Example Code:
from KmerTokenizer import KmerTokenizer from transformers import AutoModel import torch # Example gene sequence seq_list = ["ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"] # Initialize the tokenizer tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=4096) tokenized_output = tokenizer.kmer_tokenize(seq_list) # Convert tokenized output to tensor inputs = torch.tensor(tokenized_output) # Load the pre-trained BigBird model model = AutoModel.from_pretrained("MsAlEhR/MetaBERTa-bigbird-gene", output_hidden_states=True) # Generate hidden states hidden_states = model(inputs)[0] # Compute mean and max pooling of the hidden states embedding_mean = torch.mean(hidden_states[-1], dim=1) embedding_max = torch.max(hidden_states[-1], dim=1)
Citation: For a detailed overview of leveraging large language models for metagenomic analysis, refer to our paper:
Refahi, M.S., Sokhansanj, B.A., & Rosen, G.L. (Year). Leveraging Large Language Models for Metagenomic Analysis. IEEE.
- Downloads last month
- 83