4 experts per token?
Is there a reason for the change from 2 to 4 experts per token? I thought 2 was supposed to be the "sweet spot". Also, if all experts are being used for every token, is it different than an ordinary merge?
Setting experts to four just means you now have four "eyes" on every token, in-fact it has shown to help improve quality of the output on Mixtral 8x7, altho there is a drop off after so many extra experts as it does increase processing time slightly.
also it is still very different to a regular merge, a single model is just a jack of all trades and master of none it just merges all knowledge, while MoEs or multiple experts all have there own specialties or domains they are experts in, hence "Mixture if experts".
here is an experiment performed by a member of the local llama subreddit:
Even if it's not as good, could you still make an alt version with 2 experts per token? For me, this version with 4 is just too slow :\
Sure! I'll see if I can get one posed soon
I would be interested in seeing beckmarks between the two versions, just to verify objectively how much impact the extra experts make.
i actually though that as well it has been submitted to the leaderboard for a bench