Pooling method: mean vs last?
Same to title, which one should i choose for inference or training?
recommending to use the last token pooling method, please refer to the example code in the model introduction.
I noted that in the original GTE paper "Towards General Text Embeddings with Multi-stage Contrastive Learning" Section 3.1 Model Architecture, mean pooling is used. However in gte-Qwen2-7B-instruct, last token pooling is used, as is shown in the example code and config file. I wonder is there any literature reference or experience could be shared on the design choice of the pooling method? It looks like bidirectional embedding models typically use mean pooling (as is the case in the original GTE paper with BERT), while the last token embedding is more common for decoder-only LLM based models.