FlexiBERT-Mini model
Pretrained model on the English language using a macked language modeling (MLM) objective. It was found by executing a neural architecture search (NAS) over a design space of ~3.32 billion flexible and heterogeneous transformer architectures in this paper. The model is case sensitive.
Model description
The model consists of diverse attention heads including the traditional self-attention and the discrete cosine transform (DCT). The design space also supports weighted multiplicative attention (WMA), discrete Fourier transform (DFT), and convolution operations in the same transformer model along with different hidden dimensions for each encoder layer.
How to use
This model should be finetuned on a downstream task. Other models within the FlexiBERT design space can be generated using a model dicsiontary. See this github repo for more details. To instantiate a fresh FlexiBERT-Mini model (for pre-trainining using the MLM objective):
from transformers import FlexiBERTConfig, FlexiBERTModel, FlexiBERTForMaskedLM
config = FlexiBERTConfig()
model_dict = {'l': 4, 'o': ['sa', 'sa', 'l', 'l'], 'h': [256, 256, 128, 128], 'n': [2, 2, 4, 4],
'f': [[512, 512, 512], [512, 512, 512], [1024], [1024]], 'p': ['sdp', 'sdp', 'dct', 'dct']}
config.from_model_dict(model_dict)
model = FlexiBERTForMaskedLM(config)
Developer
Shikhar Tuli. For any questions, comments or suggestions, please reach me at [email protected].
Cite this work
Cite our work using the following bitex entry:
@article{tuli2022jair,
title={{FlexiBERT}: Are Current Transformer Architectures too Homogeneous and Rigid?},
author={Tuli, Shikhar and Dedhia, Bhishma and Tuli, Shreshth and Jha, Niraj K.},
year={2022},
eprint={2205.11656},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
License
BSD-3-Clause. Copyright (c) 2022, Shikhar Tuli and Jha Lab. All rights reserved.
See License file for more details.
- Downloads last month
- 4