Brilliant!
Your speed. {{slow clapping}}
Imagine how good it can be with 180B.
I hope I could run it on 4 3090.
how good?
So far falcon 40B is worse than 13B llama2 .... so 180b maybe get level of 34b llama2 or a bit better .....
edit
HA!
I was right
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
falcon 180b has level llama2 34b ....
Wow, how did you all manage to complete it so incredibly quickly?!!!!!!!!!!!!!!!!!!!
@penut85420 I only uploaded it a couple of hours ago, so it took me 24 hours. Far too long :)
Had some problems overnight; I forgot the files would be >50GB so they failed to upload until I manually split them this morning.
I hope I could run it on 4 3090.
Haha it's impossible. You need like 300+ GB of ram for that. :))
Corrections about ram:
You will need at least 400GB of memory to swiftly run inference with Falcon-180B.
how good?
So far falcon 40B is worse than 13B llama2 .... so 180b maybe get level of 34b llama2 or a bit better .....edit
HA!
I was right
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboardfalcon 180b has level llama2 34b ....
It's hard to train 180B model to unleash it's potential. A lot of resourced are required though.
No it should work - you only need enough RAM to load each piece into RAM before it goes to VRAM. And the model is now sharded (split into multiple smaller files).
Each piece is only 10GB, so in theory you only need 10GB RAM + whatever overhead there is.
As for 4 x 24GB - that won't be enough to load the 4-bit, but should be enough to load the 3-bit.
Give it a try @Pourfard and let us know! So far I've only tested it on 2 x A100 80GB and 6 x L40 48GB.
Note: the model has to be loaded with Transformers directly (not AutoGPTQ), or Text Generation Inference. Loading with AutoGPTQ, or clients that use AutoGPTQ, won't currently work due to the sharding. If you're using text-generation-webui, it should work using the Transformers loader, though I've not tested that yet myself.
It's hard to train 180B model to unleash it's potential. A lot of resourced are required though.
Sequence length of 2048 is also disappointing. And I don't think RoPE scaling works with Falcon yet (though I might be wrong - haven't checked if that was added in Transformers 4.33.0)
falcon was promising few a months ago ( maybe will be in the future ) now seems obsolete comparing to llama2 variations ... 70b llama2 easily beats 180b! model.
Meta said they are going to release llma3 in near future the should be as powerful as GPT4 sooo .... not mention they also said even started working on llama4 as well ...
falcon was promising few a months ago ( maybe will be in the future ) now seems obsolete comparing to llama2 variations ... 70b llama2 easily beats 180b! model.
Meta said they are going to release llma3 in near future the should be as powerful as GPT4 sooo .... not mention they also said even started working on llama4 as well ...
Base Llama2 was already beaten by this 180B model. 70b llama2 variations are fine tuned models and this 180B Falcon model is barebone model when it gets more trained and fine tuned you will see how capable it actually is.
falcon was promising few a months ago ( maybe will be in the future ) now seems obsolete comparing to llama2 variations ... 70b llama2 easily beats 180b! model.
Meta said they are going to release llma3 in near future the should be as powerful as GPT4 sooo .... not mention they also said even started working on llama4 as well ...Base Llama2 was already beaten by this 180B model. 70b llama2 variations are fine tuned models and this 180B Falcon model is barebone model when it gets more trained and fine tuned you will see how capable it actually is.
But how would you run it. You need insane amounts of RAM to even run this model, let alone fine tune it.
falcon was promising few a months ago ( maybe will be in the future ) now seems obsolete comparing to llama2 variations ... 70b llama2 easily beats 180b! model.
Meta said they are going to release llma3 in near future the should be as powerful as GPT4 sooo .... not mention they also said even started working on llama4 as well ...Base Llama2 was already beaten by this 180B model. 70b llama2 variations are fine tuned models and this 180B Falcon model is barebone model when it gets more trained and fine tuned you will see how capable it actually is.
But how would you run it. You need insane amounts of RAM to even run this model, let alone fine tune it.
I will just use hp proliant dl360 gen10 server with 400GB of ram. I don't have that much ram yet but I will add it in the near future. Also we can use a swap file to increase the amount of ram available but with extremely slow performance though.
What was your Tk/s on the 180B on 2 A100's?