PepDoRA: A Unified Peptide-Specific Language Model via Weight-Decomposed Low-Rank Adaptation
In this work, we introduce PepDoRA, a SMILES transformer that fine-tunes the state-of-the-art ChemBERTa-77M-MLM transformer on modified peptide SMILES via DoRA, a novel PEFT method that incorporates weight decomposition. These representations can be leveraged for numerous downstream tasks, including membrane permeability prediction and target binding assessment, for both unmodified and modified peptide sequences.
Here's how to extract PepDoRA embeddings for your input peptide:
import torch
from transformers import AutoModel,AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
# Merge the adapter with the base model
base_model = "DeepChem/ChemBERTa-77M-MLM"
adapter_model = "ChatterjeeLab/PepDoRA"
model = AutoModelForCausalLM.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter_model)
tokenizer = AutoTokenizer.from_pretrained(base_model)
from transformers import AutoModel
model_name = "ChatterjeeLab/PepDoRA"
# Load the model and the tokenizer using AutoModel
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
peptide = "CC(C)C[C@H]1NC(=O)[C@@H](C)NCCCCCCNC(=O)[C@H](CO)NC1=O"
# Tokenize the peptide
inputs = tokenizer(peptide, return_tensors="pt")
# Get the hidden states (embeddings) from the model
with torch.no_grad():
outputs = model(**inputs,output_hidden_states=True)
# Extract the embeddings from the last hidden layer
embeddng=outputs.last_hidden_state
# Print the embedding shape (or the embedding itself)
print(outputs.last_hidden_state.shape)
print(embeddng)
Repository Authors
Leyao Wang, Undergraduate Intern in the Chatterjee Lab
Pranam Chatterjee, Assistant Professor at Duke University
Reach out to us with any questions!