Questions on Training and Architecture
I’m exploring this model, particularly its training methods and architectural specifics, and I have a few questions:
How exactly is training of KaLM on top of Qwen?
What loss function or objective was used to train KaLM? Was a specific ranking or contrastive loss applied?
What metric was chosen to optimize embeddings, and how was it used in training?
Was a particular method of positional encoding used, given the multilingual scope and Qwen’s involvement?
Thank you in advance for any insights or resources on KaLM’s architecture and training processes.
Thank you for your interest in our model. We have trained it using the Qwen2 model without any architectural modifications. For detailed information on the architecture, please refer to the Qwen model documentation.
Regarding the loss function, we employ the widely-used Info-NCE loss. You can currently access the training code from FlagEmbedding.
We will be releasing more details about the training process and data soon.