Questions about the tokenization cycle
Hi, I wonder how to determine the tokenization cycle in the real application, because it seems that the tokenization function of hugginfgace does not provide such cycle design. Can I directly use your tokenization file, and what is the max length of your model input? Thanks a lot.
Hi,
the code for the next-k-mer prediction, which we used to determine the cycle numbers is linked to this paper: https://link.springer.com/article/10.1186/s12859-024-05869-5
the code on how we applied it to determine the cycle numbers is linked to this paper: https://www.nature.com/articles/s42256-024-00872-0
For the hg19 genome you can directly use the tokenisation we provide. For any other human genome built you can reapply the tokenisation rules. For any other genome I would recommend to redetermine the tokenisation rules.
I am not sure what exactly you mean with "model input". If it is about the DNA window as input for the transformer in the pretraining, it is 510 tokens plus two special tokens. Our average token length is about 4 nt, so it is dependent on sequence complexity roughly on average 2 kb. I am not sure whether this answers your question.
Please also notify me by email ([email protected]), if you make more comments, we dont really watch this space.