Comparison with Tamil Llama and stance on encoder only model
@Hemanth-thunder - Thank you for sharing this model. Happy to see the development of mistral series models for Tamil. Congratulations and Kudos for the inspiring effort. IMHO, I like the Mistral series better because of the Apache 2.0 license compared to the Llama license, which may not be as lenient as Apache 2.0 license.
I have a few curiosity questions. (questions are based on comparing Tamil Llama paper on arxiv)
What is the difference between this and Tamil llama model 7B?
Is it more tokens, data, and the sliding window attention from Mistral?Since Mistral is a decoder-only model, I would love to hear your thoughts on developing an encoder-only model for Tamil.
My interest is to develop a Tamil RAG Application.
We need a Tamil-specific retriever and a generator LLM. I would love to hear your thoughts on fine tuning an embedding model to improve retrieval quality for Tamil.
Thank you for your kind words and appreciation! I'm glad you're excited about the Mistral series models for Tamil. Indeed, the Mistral models represent a significant development, especially with the adoption of the Apache 2.0 license.
Yes, there are a few differences. I used a bigger dataset compared to the Tamil Llama model 7B. Also, using a sliding window helps speed things up and allows handling more tokens in one go for single queries to corresponding K and V.
Great to hear that you're also working on a Tamil RAG application! Developing an encoder-only model for Tamil embedding sounds like a crucial step toward building an effective retrieval system. By leveraging such an encoding model, we can calculate the similarity between query and document chunks, enabling us to retrieve relevant sentences efficiently. Moreover, fine-tuning both the Tamil LLMs for Llama and Mistral architectures would be essential. This process would help optimize the language models specifically for Tamil, ensuring better performance and accuracy in generating responses.