sayby commited on
Commit
7a49887
1 Parent(s): 2a44926

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ pipeline_tag: token-classification
4
+ tags:
5
+ - biology
6
+ - RNA
7
+ - Torsional
8
+ - Angles
9
+ ---
10
+ # `RNA-TorsionBERT`
11
+
12
+ ## Model Description
13
+
14
+ `RNA-TorsionBERT` is a 331 MB parameter BERT-based language model that predicts RNA torsional and pseudo-torsional angles from the sequence.
15
+
16
+ `RNA-TorsionBERT` is a DNABERT model that was pre-trained on ~4200 RNA structures before being fine-tuned on 185 non-redundant structures.
17
+
18
+ It provides an improvement of MAE of 6.2° over the previous state-of-the-art model, SPOT-RNA-1D, on the Test Set (composed of RNA-Puzzles and CASP-RNA).
19
+
20
+
21
+ | Model | alpha | beta | gamma | delta | epsilon | zeta | chi | eta | theta |
22
+ |------------------|----------|-------|-------|-------|---------|-------|-------|-------|-------|
23
+ | **RNA-TorsionBERT** | 37.3 | 19.6 | 29.4 | 13.6 | 16.6 | 26.6 | 14.7 | 20.1 | 25.4 |
24
+ | SPOT-RNA-1D | 45.7 | 23 | 33.6 | 19 | 21.1 | 34.4 | 19.3 | 28.9 | 33.9 |
25
+
26
+ **Key Features**
27
+ * Torsional and Pseudo-torsional angles prediction
28
+ * Predict sequences up to 512 nucleotides
29
+
30
+ ## Usage
31
+
32
+ Get started generating text with `RNA-TorsionBERT` by using the following code snippet:
33
+
34
+ ```python
35
+ from transformers import AutoModel, AutoTokenizer
36
+
37
+ tokenizer = AutoTokenizer.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True)
38
+ model = AutoModel.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True)
39
+
40
+ sequence = "ACG CGG GGT GTT"
41
+ params_tokenizer = {
42
+ "return_tensors": "pt",
43
+ "padding": "max_length",
44
+ "max_length": 512,
45
+ "truncation": True,
46
+ }
47
+ inputs = tokenizer(sequence, **params_tokenizer)
48
+ output = model(inputs)["logits"]
49
+ ```
50
+
51
+ - Please note that it was fine-tuned from a DNABERT-3 model and therefore the tokenizer is the same as the one used for DNABERT. Nucleotide `U` should therefore be replaced by `T` in the input sequence.
52
+ - The output is the sinus and the cosine for each angle. The angles are in the following order: `alpha`, `beta`, `gamma`, `delta`, `epsilon`, `zeta`, `chi`, `eta`, `theta`.
53
+
54
+ To convert the predictions into angles, you can use the following code snippet:
55
+
56
+ ```python
57
+ from typing import Optional
58
+
59
+ import numpy as np
60
+
61
+ ANGLES_ORDER = [ "alpha", "beta", "gamma", "delta", "epsilon", "zeta", "chi", "eta", "theta" ]
62
+
63
+ def convert_sin_cos_to_angles(output: np.ndarray, input_ids: Optional[np.ndarray] = None):
64
+ """
65
+ Convert the raw predictions of the RNA-TorsionBERT into angles.
66
+ It converts the cos and sinus into angles using:
67
+ alpha = arctan(sin(alpha)/cos(alpha))
68
+ :param output: Dictionary with the predictions of the RNA-TorsionBERT per angle
69
+ :param input_ids: the input_ids of the RNA-TorsionBERT. It allows to only select the of the sequence,
70
+ and not the special tokens.
71
+ :return: a np.ndarray with the angles for the sequence
72
+ """
73
+ if input_ids is not None:
74
+ output[ (input_ids == 0) | (input_ids == 1) | (input_ids == 2) | (input_ids == 3) | (input_ids == 4) ] = np.nan
75
+ pair_indexes, impair_indexes = np.arange(0, output.shape[-1], 2), np.arange(
76
+ 1, output.shape[-1], 2
77
+ )
78
+ sin, cos = output[:, :, impair_indexes], output[:, :, pair_indexes]
79
+ tan = np.arctan2(sin, cos)
80
+ angles = np.degrees(tan)
81
+ return angles
82
+
83
+ output = output.cpu().detach().numpy()
84
+ input_ids = inputs["input_ids"].cpu().detach().numpy()
85
+ real_angles = convert_sin_cos_to_angles(output, input_ids)
86
+ ```