etrop commited on
Commit
8835e94
1 Parent(s): 0db4b4b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -6
README.md CHANGED
@@ -1,6 +1,11 @@
 
 
 
 
 
1
  ## Model Overview
2
  AgroNt is a DNA language model trained on primarily edible plant genomes. More specifically, AgroNT uses the transformer architecture with self-attention and a masked language modeling
3
- objective to leverage highly available genotype data from 48 different plant speices to learn general representations of nucleotide sequences. AgroNT contains 1 billion parameters and has a context window of 1000 tokens.
4
  AgroNt uses a non-overlapping 6-mer tokenizer to convert genomic nucletoide sequences to tokens. As a result the 1024 tokens correspond to approximately 6144 base pairs.
5
 
6
 
@@ -48,10 +53,9 @@ Our pre-training dataset was built from (mostly) edible plants reference genomes
48
  The dataset consists of approximately 10.5 million genomic sequences across 48 different species.
49
 
50
  #### Processing
51
- All reference genomes for each specie were assembled into a single fasta file. In this fasta file, all nucleotides other than A, T, C, G were replaced by N. We used a tokenizer to convert strings of letters into sequences of tokens.
52
- The tokenizer's alphabet consisted of the $4^6 = 4096$ possible 6-mer combinations obtained by combining A, T, C, G, as well as five additional tokens
53
- representing standalone A, T, C, G, and N. It also included three special tokens: the padding [PAD], masking [MASK], and the beginning of sequence
54
- (also called class; [CLS]) token. This resulted in a vocabulary of 4104 tokens. To tokenize an input sequence, the tokenizer started with a class token and
55
  then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
56
  N was present or if the sequence length was not a multiple of 6).
57
 
@@ -64,4 +68,16 @@ The tokenized sequence is passed through the model and a cross entropy loss is c
64
  and an effective batch size of 1.5M tokens for 315k update steps, resulting in the model training on a total of 472.5B tokens.
65
 
66
  #### Hardware
67
- Model pre-training was carried out using Google TPU-V4 accelerators, specifically a TPU v4-1024 containing 512 devices. We trained for a total of approx. four days.
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ datasets:
4
+ - InstaDeepAI/plant-genomic-benchmark
5
+ ---
6
  ## Model Overview
7
  AgroNt is a DNA language model trained on primarily edible plant genomes. More specifically, AgroNT uses the transformer architecture with self-attention and a masked language modeling
8
+ objective to leverage highly available genotype data from 48 different plant speices to learn general representations of nucleotide sequences. AgroNT contains 1 billion parameters and has a context window of 1024 tokens.
9
  AgroNt uses a non-overlapping 6-mer tokenizer to convert genomic nucletoide sequences to tokens. As a result the 1024 tokens correspond to approximately 6144 base pairs.
10
 
11
 
 
53
  The dataset consists of approximately 10.5 million genomic sequences across 48 different species.
54
 
55
  #### Processing
56
+ All reference genomes for each specie were assembled into a single fasta file. In this fasta file, all nucleotides other than A, T, C, G were replaced by N. A tokenizer was used to convert strings of letters into sequences of tokens.
57
+ The tokenizer's alphabet consisted of the 4<sup>6</sup> = 4096 possible 6-mer combinations obtained by combining A, T, C, G, as well as five additional tokens
58
+ representing standalone A, T, C, G, and N. It also included three special tokens: the pad [PAD], mask [MASK], and class [CLS] tokens. This resulted in a vocabulary of 4104 tokens. To tokenize an input sequence, the tokenizer started with a class token and
 
59
  then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
60
  N was present or if the sequence length was not a multiple of 6).
61
 
 
68
  and an effective batch size of 1.5M tokens for 315k update steps, resulting in the model training on a total of 472.5B tokens.
69
 
70
  #### Hardware
71
+ Model pre-training was carried out using Google TPU-V4 accelerators, specifically a TPU v4-1024 containing 512 devices. We trained for a total of approx. four days.
72
+
73
+ ### BibTeX entry and citation info
74
+ ```bibtex
75
+ @article{mendoza2023foundational,
76
+ title={A Foundational Large Language Model for Edible Plant Genomes},
77
+ author={Mendoza-Revilla, Javier and Trop, Evan and Gonzalez, Liam and Roller, Masa and Dalla-Torre, Hugo and de Almeida, Bernardo P and Richard, Guillaume and Caton, Jonathan and Lopez Carranza, Nicolas and Skwark, Marcin and others},
78
+ journal={bioRxiv},
79
+ pages={2023--10},
80
+ year={2023},
81
+ publisher={Cold Spring Harbor Laboratory}
82
+ }
83
+ ```