Bioinspired LLMs
Collection
A set of LLMs fine-tuned on biological materials, mechanics, and materials science applications.
•
14 items
•
Updated
This model is instruction-tuned on top of lamm-mit/BioinspiredLlama-3-1-8B-128k
to predict the dominant secondary structure, based on the input of an amino acid sequence.
Sample instruction:
Dominant secondary structure of < N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E >
Response:
AH
Raw format of training data (in Llama 3.1 chat template format):
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nDominant secondary structure of < V V F D V V F D V V F D V V F D V V F D V V F D V V F D V V F D ><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nUNSTRUCTURED<|eot_id|>
Here is a visual representation of what the model predicts:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure',
trust_remote_code=True,
device_map="auto",
torch_dtype =torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained('lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure',)
Load in 4 bit quantization:
base_model_name = "lamm-mit/BioinspiredLlama-3-1-8B-128k"
model = 'lamm-mit/BioinspiredLlama-3-1-8B-128k-dominant-protein-SS-structure'
bnb_config4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
use_nested_quant = False,
)
model = AutoModelForCausalLM.from_pretrained(
base_model_name,
trust_remote_code=True,
device_map="auto",
quantization_config= bnb_config4bit,
torch_dtype =torch.bfloat16,
)
model = PeftModel.from_pretrained(model, new_model, )
Inference function for convenience:
def generate_response (text_input="What is spider silk?",
system_prompt='You are a biological materials scientist.',
num_return_sequences=1,
temperature=1., #the higher the temperature, the more creative the model becomes
max_new_tokens=127,device='cuda',
num_beams=1,eos_token_id= [
128001,
128008,
128009
],
top_k = 50,
top_p =0.9,
repetition_penalty=1.1,
messages=[],
):
if messages==[]:
messages=[{"role": "system", "content":system_prompt},
{"role": "user", "content":text_input}]
else:
messages.append ({"role": "user", "content":text_input})
text_input = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text_input], add_special_tokens =True, return_tensors ='pt' ).to(device)
with torch.no_grad():
outputs = model.generate(**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
num_beams=num_beams,
top_k = top_k,eos_token_id=eos_token_id,
top_p =top_p,
num_return_sequences = num_return_sequences,
do_sample =True, repetition_penalty=repetition_penalty,
)
outputs=outputs[:, inputs["input_ids"].shape[1]:]
return tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True), messages
Usage:
AA_sequence='N E R R I L E Q K K H Y F W L L L Q R T Y T K T G K P K P S T W D L A S K E L G E S L E Y K A L G D E D N I R R Q I F E D F K P E'
answer,_ = generate_response (text_input='Dominant secondary structure of < '+AA_sequence+' >', max_new_tokens=16, temperature=0.1)
print (f"Prediction: {answer[0]}")
The output is:
Prediction: AH
A visualization of the protein, to check:
As predicted, this protein (PDB ID 6N7P, https://www.rcsb.org/structure/6N7P) is primarily alpha-helical.
This model has been trained using QLoRA, on sequences shorter than 128 amino acids. Training set and split:
max_seq_length = 128
# Dataset
dataset = load_dataset('lamm-mit/protein_secondary_structure_from_PDB')['train']
dataset = dataset.filter(lambda example: example['Sequence_length'] < max_seq_length)
# Rename columns
dataset = dataset.rename_column('Sequence_spaced', 'question')
dataset = dataset.rename_column('Primary_SS_Type', 'answer')
dataset = dataset.train_test_split(test_size=0.1, seed=42)
...
Test accuracy: 77.23%
@article{Buehler_2024,
title={Fine-tuned LLMs for protein and other molecular feature predictions},
author={Markus J. Buehler},
journal={},
year={2024}
}
Base model
lamm-mit/BioinspiredLlama-3-1-8B-128k