Confusing ArcticDecoderLayer::forward() implementation
#11
by
sszymczyk
- opened
I'm a bit confused about the ArcticDecoderLayer::forward() method implementation in the model:
- Does the model work correctly with parallel_attn_mlp_res set to false?
- There is a normalization layer called post_attention_layernorm. Do I understand correctly that if parallel_attn_mlp_res is set to true then it actually normalizes the layer input, not the attention output?