Dynamic 8x7B Mixtral Model
Nous-Hermes-2-Mixtral-8x7B-17m-DPO-raw : 17 MoE FF Layers, 15 Dense FF Layers
Model Details
Model Description
MoE layer pruning test modified from Nous-Hermes-2-Mixtral-8x7B-DPO. So it uses the same chatml format for conversations.
15 layers of MoE is merged into a normal feed forward layer ( 17/32 layers are MoE), so the total params are reduced from 47B to 14B.
Pruned layers index are as follows:
[3, 4, 7, 10, 11, 23, 24, 25, 26, 27, 28, 29]
- Developed by: MistralAI, NousResearch, theblackcat
- Model type: Modified Mixtral Architecture for dynamic MoE
- License: apache-2.0
Model Sources [optional]
- Repository: [More Information Needed]
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
Experiment stage, still finding the best sweet spot for running just under 24G memory under 4 bit-quantization config.
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = CustomMixtralForCausalLM.from_pretrained(model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
load_in_4bit=True,
trust_remote_code=True
)
pytorch_total_params = sum(p.numel() for p in model.parameters())
print(pytorch_total_params/1e9)
max_length = 100
input_text = """<|im_start|>user\nHow are you? Write a story for me please<|im_end|><|im_start|>assistant\n"""
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to('cuda')
print(len(input_ids[0]))
output = model.generate(input_ids, max_length=max_length, temperature=0.7, repetition_penalty=1.1, do_sample=True)
print(tokenizer.decode(output[0]))
- Downloads last month
- 7
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.