PoetschLab/GROVER · Questions about the tokenization cycle

Hi,
the code for the next-k-mer prediction, which we used to determine the cycle numbers is linked to this paper: https://link.springer.com/article/10.1186/s12859-024-05869-5
the code on how we applied it to determine the cycle numbers is linked to this paper: https://www.nature.com/articles/s42256-024-00872-0
For the hg19 genome you can directly use the tokenisation we provide. For any other human genome built you can reapply the tokenisation rules. For any other genome I would recommend to redetermine the tokenisation rules.
I am not sure what exactly you mean with "model input". If it is about the DNA window as input for the transformer in the pretraining, it is 510 tokens plus two special tokens. Our average token length is about 4 nt, so it is dependent on sequence complexity roughly on average 2 kb. I am not sure whether this answers your question.
Please also notify me by email ([email protected]), if you make more comments, we dont really watch this space.