Speed on CPU

#8
by zokica - opened

I have tried llama 7B and this model on a CPU, and LLama is much faster (7 seconds vs 43 for 20 tokens). Is this the right way to run the model on a CPU or I am missing something:

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "mosaicml/mpt-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True,trust_remote_code=True)

import time
timea = time.time()
prompt = "A lion is"
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
outputs = model.generate(
    **inputs, max_new_tokens=20, do_sample=True, temperature=0.75 , return_dict_in_generate=True
)
token = outputs.sequences[0]
output_str = tokenizer.decode(token)
print(output_str)
print("timea = time.time()",-timea + time.time())

The out:

MPT 7b:

A lion is a large cat. Lions are native to Africa. Lions live in the savanna, a grassland
timea = time.time() 43.37369394302368

LLama 7b:

<s> A lion is the king of the jungle. The lion is the strongest animal in the animal kingdom
timea = time.time() 6.919593811035156

you're comparing ggml vs PyTorch – until this gets the ggml treatment expect the speeds to be slower on CPU only

you're comparing ggml vs PyTorch – until this gets the ggml treatment expect the speeds to be slower on CPU only

How did you conclude that i used ggml?

Of course, I did not use ggml, I used exactly the same BF16 for both llama and mpt-7b and llama is much faster.
model_name = "huggyllama/llama-7b"
model_name = "mosaicml/mpt-7b"

Here is exactly what I used for llama so you can replicate and see for yourself:

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "huggyllama/llama-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True,trust_remote_code=True)

import time
timea = time.time()
prompt = "A lion is"
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
outputs = model.generate(
    inputs.input_ids, max_new_tokens=20, do_sample=True, temperature=0.75 , return_dict_in_generate=True
)
token = outputs.sequences[0]
output_str = tokenizer.decode(token)
print(output_str)
print("timea = time.time()",-timea + time.time())

Hi @zokica , we will take a look at this as we're seeing a couple reports of slow CPU inference. Since you have a system on hand that is showing the issue, could you help confirm if any of the MPT vs. LLaMa speed gap changes based on the torch_dtype and low_cpu_mem_usage flags? Basically this matrix:

  • torch_dtype=torch.float32, low_cpu_mem_usage=False: ?
  • torch_dtype=torch.float32, low_cpu_mem_usage=True: ?
  • torch_dtype=torch.bfloat16, low_cpu_mem_usage=False: ?
  • torch_dtype=torch.bfloat16, low_cpu_mem_usage=True: MPT slower than LLaMa

In the meantime we will try to reproduce as well. Thank you for the report!

Hi,

I testet both scripts from zokica above,
on a cheap VPS, 18 Cores, 48 GB RAM, 2048 GB SSD (RAID10).

LLaMa still faster, but with float32 "just" by factor 2.

torch_dtype=torch.float32, low_cpu_mem_usage=False: | MPT: 95.7 | LLaMa: 43.2
torch_dtype=torch.float32, low_cpu_mem_usage=True: | MPT: 98.6s | LLaMa: 48.6
torch_dtype=torch.bfloat16, low_cpu_mem_usage=False:| MPT: 1747.8 | LLaMa: 177.7
torch_dtype=torch.bfloat16, low_cpu_mem_usage=True: | MPT: 1764.6s | LLaMa: 178.2

Thank you so much! This definitely seems like a bottleneck somewhere in the MPT forward or KV cacheing logic. It's very interesting that this shows up on CPU but not on GPU (where we saw the opposite relation, ~1.5-2x faster for MPT with triton). We will look into it and patch the model source once we find a fix.

Last question, what version of torch were you using for those results?

I actually run it via BF16, as I have only 32 GB of ram in this server, so i had to use a low ram option.

Is there any other way to run it on a CPU without using bf16 with just 32 GB of memory?

I am using, and probably most people will just use the CPU for testing, it would be nice if this could work a bit faster, but not so much of a problem.

So it works faster than LLama on a GPU, right, even without triton ?

you're comparing ggml vs PyTorch – until this gets the ggml treatment expect the speeds to be slower on CPU only

There are ggml versions in hugging face 🤗

Thank you so much! This definitely seems like a bottleneck somewhere in the MPT forward or KV cacheing logic. It's very interesting that this shows up on CPU but not on GPU (where we saw the opposite relation, ~1.5-2x faster for MPT with triton). We will look into it and patch the model source once we find a fix.

Last question, what version of torch were you using for those results?

2.0.1+cpu

For me. It is taking 35 mins to generate 100 tokens.
Laptop specification: No GPU, 20 GB RAM (4+16 GB), 1 TB SSD, I5 processor.
I have very slow Laptop with No GPU.

def customGenerate(argPrompt):
    inputs = tokenizer(argPrompt, return_tensors='pt').to(model.device)
    outputs = model.generate(
        **inputs, max_new_tokens=1, do_sample=True, temperature=0.75 , return_dict_in_generate=True
    )
    token = outputs.sequences[0]
    output_str = tokenizer.decode(token)

    return output_str

import time
from datetime import datetime
timea = time.time()
dtNow = datetime.now()
print("now =", dtNow)
print("Start time: ",-timea + time.time())

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "mosaicml/mpt-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True,trust_remote_code=True)

prompt = ["Earth is"]

count=0

while(count < 100):
    output_str = customGenerate(prompt[prompt.__len__()-1])
    prompt.append(output_str)
    print(prompt.__len__(), ': ' , prompt[prompt.__len__()-1])
    print("Time taken in sec:",-timea + time.time())
    print("Time taken in min:",((-timea + time.time())/60))
    count = count + 1

dtNow = datetime.now()
print("now =", dtNow)

I am having a hard time running this on CPU, could someone please help me? I get the error:

ImportError: This modeling file requires the following packages that were not found in your environment: einops. Run pip install einops

But then it seems einops needs to find a CUDA Driver to be installed :(

The CPU load time should be fixed now as of this PR as long as you use device_map=auto: https://huggingface.co/mosaicml/mpt-7b/discussions/47
We also added some logic to improve KV cacheing speed. Let us know if you see improvements!

Closing as complete, but if anyone sees any CPU inference speed issues, please reopen this or open a new issue!

sam-mosaic changed discussion status to closed

Sign up or log in to comment