ProteinPGLM-10B-MLM

Model Introduction

ProteinPGLM-10B-MLM is the open-source version of the latest masked protein language models designed to protein understanding tasks. The ProteinPGLM family models are developed by Tsinghua University. Along with this, we have released the int4 quantization ProteinPGLM-100B weights and other small models, which include: 1B, 3B, and 10B models trained with masked language modeling for protein understanding, and 1B, 3B, and 7B causal language models aimed at protein design.

Out-of-Distribution Perplexity Evaluation

We evaluated the ProteinPGLM-MLM (PGLM) and ProteinPGLM-INT4(100B) models on two OOD test sets, one with sequence identity lower than 0.9 with the training set (<0.9 ID) and the other with sequence identity lower than 0.5 with the training set (<0.5 ID). Each OOD dataset comprises approximately 10,000 protein sequences. The perplexity results, compared against ESM2-3B and ESM2-15B, are as follows (lower is better):

Model	ESM2(3B)	ESM2 (15B)	PGLM (1B)	PGLM (3B)	PGLM (10B)	PGLM-INT4 (100B)
< 0.9 ID	7.7	7.3	9.3	7.8	7.6	6.8
< 0.5 ID	11.5	11.0	13.5	11.9	11.6	10.8

Downstream Protein Understanding Tasks Evaluation

(TODO)

How to use


# Obtain residue embeddings
from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoTokenizer, AutoConfig
import torch

tokenizer  = AutoTokenizer.from_pretrained("Bo1015/proteinglm-10b-mlm", trust_remote_code=True, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained("Bo1015/proteinglm-10b-mlm",  trust_remote_code=True, torch_dtype=torch.bfloat16)
if torch.cuda.is_available():
    model = model.cuda()
model.eval()

seq = 'MILMCQHFSGQFSKYFLAVSSDFCHFVFPIILVSHVNFKQMKRKGFALWNDRAVPFTQGIFTTVMILLQYLHGTG'
output = tokenizer(seq, add_special_tokens=True, return_tensors='pt')
with torch.inference_mode():
    inputs = {"input_ids": output["input_ids"].cuda(), "attention_mask": output["attention_mask"].cuda()}
    output_embeddings = model(**inputs, output_hidden_states=True, return_last_hidden_state=True).hidden_states[:-1, 0] # get rid of the <eos> token


# model for the sequence-level tasks
model = AutoModelForSequenceClassification.from_pretrained("Bo1015/proteinglm-10b-mlm",  trust_remote_code=True, torch_dtype=torch.bfloat16)

# model for the token-level tasks
model = AutoModelForTokenClassification.from_pretrained("Bo1015/proteinglm-10b-mlm",  trust_remote_code=True, torch_dtype=torch.bfloat16)

LICENSE

The model in this repository is open source under the Creative Commons Attribution-NonCommercial 4.0 International License.

Citations

If you find our work useful, please consider citing the following paper:

@misc{chen2024xtrimopglm,
  title={xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein},
  author={Chen, Bo and Cheng, Xingyi and Li, Pan and Geng, Yangli-ao and Gong, Jing and Li, Shen and Bei, Zhilei and Tan, Xu and Wang, Boyan and Zeng, Xin and others},
  year={2024},
  eprint={2401.06199},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  note={arXiv preprint arXiv:2401.06199}
}

@misc{cheng2024training,
  title={Training Compute-Optimal Protein Language Models},
  author={Cheng, Xingyi and Chen, Bo and Li, Pan and Gong, Jing and Tang, Jie and Song, Le},
  year={2024},
  note={bioRxiv, Cold Spring Harbor Laboratory, pages 2024--06}
}