Missing of the untokenized "loom" datasets
Hi, I notice that the untokenized single cell dataset which is used in tokenizer.py is missing. Do you have plans to release that? Data collection is also a major contribution of Geneformer, and it would be very helpful to enable reusage of the datasets for the entire community! Thank you
Thank you for your question and interest in Geneformer! The primary purpose of this repository is to enable the community to use the pretrained Geneformer model to answer their own scientific questions. Large-scale pretraining of a foundation model like Geneformer requires expertise and computational resources that are not universally available so by providing the pretrained model we can democratize the fundamental knowledge gained by Geneformer during pretraining to the broader scientific community.
We additionally provide the code for pretraining along with Genecorpus-30M in the dataset repository to help researchers who would like to pretrain their own models instead of using Geneformer. One of the contributions of this work is developing a new method for encoding gene expression data as rank value encodings so we provide the tools for researchers to do that as well. All of the raw counts data that we tokenized in Genecorpus-30M is publicly available and not data that we generated. The datasets are available from the original authors and databases as cited in the manuscript Methods. Additionally, Genecorpus-30M was established over 2 years ago, so there is much more data available since then and now there are even more databases for collecting large amounts of data in a straightforward way. We would encourage users to check out CellxGene (https://cellxgene.cziscience.com/), for example, which is a new endeavor that contains large amounts of single cell RNAseq data along with structured metadata.