Length Generalization of Causal Transformers without Position Encoding
Paper
•
2404.12224
•
Published
•
1
Note Scaling the attention score by a factor of 1.2, NoPE can immediately generalize to over 4K tokens (Figure 1). Page 2 Figure 2: NoPE can generalize to longer context by merely scaling the softmax scores. However, this exact technique does not directly apply to RoPE models.
Note Decoders: Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks. More importantly, NoPE outperforms other explicit positional encoding methods while requiring no additional computation... NoPE can represent both absolute and relative PEs, but when trained with SGD, it mostly resembles T5’s Relative PE attention patterns.