MegaEfficentModel-PyTorch
⚠️🪖 This model has not been tested in nakamoto. Usage is at your own risk AND NOT RECOMMENDED. Please see the disclaimer below ⚠️🪖
current version : 0.1
sequence length : 512
layers : 24
attention heads : 24
dimension : 2048
learning rate : 2e-4
trained steps : 383000
GPT architectures have proven quite useful in many areas of research and industry, yet their usage is confined to high end NVIDIA GPUs. This prevents many researchers and enthusiasts from performing rapid experimentation and development on large language models.
Our mission at Opentensor Cortex is to develop the critical infrastructure and tooling to enable researchers, enthusiasts, and server operators to run large language models on consumer graphics cards.
MegaEfficentModel (MEM for short) incorporates a new type of attention from HazyResearch called FlashAttention. This new type of attention uses a Fused Kernel to decrease the time to perform the attention operation, while allowing for longer sequence lengths greater than 2048 in a memory efficient manner.
Anecdotal VRAM usage of MEM-Pytorch compared to GPT-Neo-1.3b showed an improvement from 8GB to 3GB. Autocast may offer additional memory improvements.
DISCLAIMER This model has many bugs that need to be squashed, optimizations to be performed, and lacks proper benchmarking. These will be added over the coming months, but are not currently reflected in the codebase. Please reach out to the @Cortex team in the discord for more questions.
Credits
Opentensor Foundation : provided the compute to train these models.
Lucidrains : MEM is inspired from their work on flash attention