Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
1 |
## Model Overview
|
2 |
AgroNt is a DNA language model trained on primarily edible plant genomes. More specifically, AgroNT uses the transformer architecture with self-attention and a masked language modeling
|
3 |
-
objective to leverage highly available genotype data from 48 different plant speices to learn general representations of nucleotide sequences. AgroNT contains 1 billion parameters and has a context window of
|
4 |
AgroNt uses a non-overlapping 6-mer tokenizer to convert genomic nucletoide sequences to tokens. As a result the 1024 tokens correspond to approximately 6144 base pairs.
|
5 |
|
6 |
|
@@ -48,10 +53,9 @@ Our pre-training dataset was built from (mostly) edible plants reference genomes
|
|
48 |
The dataset consists of approximately 10.5 million genomic sequences across 48 different species.
|
49 |
|
50 |
#### Processing
|
51 |
-
All reference genomes for each specie were assembled into a single fasta file. In this fasta file, all nucleotides other than A, T, C, G were replaced by N.
|
52 |
-
The tokenizer's alphabet consisted of the
|
53 |
-
representing standalone A, T, C, G, and N. It also included three special tokens: the
|
54 |
-
(also called class; [CLS]) token. This resulted in a vocabulary of 4104 tokens. To tokenize an input sequence, the tokenizer started with a class token and
|
55 |
then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
|
56 |
N was present or if the sequence length was not a multiple of 6).
|
57 |
|
@@ -64,4 +68,16 @@ The tokenized sequence is passed through the model and a cross entropy loss is c
|
|
64 |
and an effective batch size of 1.5M tokens for 315k update steps, resulting in the model training on a total of 472.5B tokens.
|
65 |
|
66 |
#### Hardware
|
67 |
-
Model pre-training was carried out using Google TPU-V4 accelerators, specifically a TPU v4-1024 containing 512 devices. We trained for a total of approx. four days.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-nc-sa-4.0
|
3 |
+
datasets:
|
4 |
+
- InstaDeepAI/plant-genomic-benchmark
|
5 |
+
---
|
6 |
## Model Overview
|
7 |
AgroNt is a DNA language model trained on primarily edible plant genomes. More specifically, AgroNT uses the transformer architecture with self-attention and a masked language modeling
|
8 |
+
objective to leverage highly available genotype data from 48 different plant speices to learn general representations of nucleotide sequences. AgroNT contains 1 billion parameters and has a context window of 1024 tokens.
|
9 |
AgroNt uses a non-overlapping 6-mer tokenizer to convert genomic nucletoide sequences to tokens. As a result the 1024 tokens correspond to approximately 6144 base pairs.
|
10 |
|
11 |
|
|
|
53 |
The dataset consists of approximately 10.5 million genomic sequences across 48 different species.
|
54 |
|
55 |
#### Processing
|
56 |
+
All reference genomes for each specie were assembled into a single fasta file. In this fasta file, all nucleotides other than A, T, C, G were replaced by N. A tokenizer was used to convert strings of letters into sequences of tokens.
|
57 |
+
The tokenizer's alphabet consisted of the 4<sup>6</sup> = 4096 possible 6-mer combinations obtained by combining A, T, C, G, as well as five additional tokens
|
58 |
+
representing standalone A, T, C, G, and N. It also included three special tokens: the pad [PAD], mask [MASK], and class [CLS] tokens. This resulted in a vocabulary of 4104 tokens. To tokenize an input sequence, the tokenizer started with a class token and
|
|
|
59 |
then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
|
60 |
N was present or if the sequence length was not a multiple of 6).
|
61 |
|
|
|
68 |
and an effective batch size of 1.5M tokens for 315k update steps, resulting in the model training on a total of 472.5B tokens.
|
69 |
|
70 |
#### Hardware
|
71 |
+
Model pre-training was carried out using Google TPU-V4 accelerators, specifically a TPU v4-1024 containing 512 devices. We trained for a total of approx. four days.
|
72 |
+
|
73 |
+
### BibTeX entry and citation info
|
74 |
+
```bibtex
|
75 |
+
@article{mendoza2023foundational,
|
76 |
+
title={A Foundational Large Language Model for Edible Plant Genomes},
|
77 |
+
author={Mendoza-Revilla, Javier and Trop, Evan and Gonzalez, Liam and Roller, Masa and Dalla-Torre, Hugo and de Almeida, Bernardo P and Richard, Guillaume and Caton, Jonathan and Lopez Carranza, Nicolas and Skwark, Marcin and others},
|
78 |
+
journal={bioRxiv},
|
79 |
+
pages={2023--10},
|
80 |
+
year={2023},
|
81 |
+
publisher={Cold Spring Harbor Laboratory}
|
82 |
+
}
|
83 |
+
```
|