FlexiBERT-Mini model

Pretrained model on the English language using a macked language modeling (MLM) objective. It was found by executing a neural architecture search (NAS) over a design space of ~3.32 billion flexible and heterogeneous transformer architectures in this paper. The model is case sensitive.

Model description

The model consists of diverse attention heads including the traditional self-attention and the discrete cosine transform (DCT). The design space also supports weighted multiplicative attention (WMA), discrete Fourier transform (DFT), and convolution operations in the same transformer model along with different hidden dimensions for each encoder layer.

How to use

This model should be finetuned on a downstream task. Other models within the FlexiBERT design space can be generated using a model dicsiontary. See this github repo for more details. To instantiate a fresh FlexiBERT-Mini model (for pre-trainining using the MLM objective):

from transformers import FlexiBERTConfig, FlexiBERTModel, FlexiBERTForMaskedLM
config = FlexiBERTConfig()
model_dict = {'l': 4, 'o': ['sa', 'sa', 'l', 'l'], 'h': [256, 256, 128, 128], 'n': [2, 2, 4, 4],
      'f': [[512, 512, 512], [512, 512, 512], [1024], [1024]], 'p': ['sdp', 'sdp', 'dct', 'dct']}
config.from_model_dict(model_dict)
model = FlexiBERTForMaskedLM(config)

Developer

Shikhar Tuli. For any questions, comments or suggestions, please reach me at [email protected].

Cite this work

Cite our work using the following bitex entry:

@article{tuli2022jair,
      title={{FlexiBERT}: Are Current Transformer Architectures too Homogeneous and Rigid?}, 
      author={Tuli, Shikhar and Dedhia, Bhishma and Tuli, Shreshth and Jha, Niraj K.},
      year={2022},
      eprint={2205.11656},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

See License file for more details.

shikhartuli
/

flexibert-mini