Understanding CoreML conversion of llama 2 7b
Could you kindly provide more details on the hardware used and process of conversion in a blog/guide style. So many of the community members can benefit the learnings.
+1
I'll publish a guide focused on conversion in a few days!
In addition, we need to provide some of the pieces required to perform text generation with the converted model: tokenizers, text generation strategies, etc. Working on it!
That would be amazing! TIA.
Thanks for the work
Hi! Thanks a lot for your work! Where the guide can be found?
Amazing work!, There is a variant of the diffusers app adapted for querying llama2?
In case you didn't see it, we published swift-transformers
and this post a couple of weeks ago: https://huggingface.co/blog/swift-coreml-llm
Please, let us know if that's helpful, or if you'd like us to dive in more depth on any of the topics :)
complete noob here, but would it be possible to show how to run the coreML model? I am attempting to build a stock app that can process news and give a summary on a stock, but when I load the model, it requires the attention mask and the inputs are in the form of an integer array. Not sure how to use the tokenizer in coreML for it
Thanks for sharing!
@Ovats
Perhaps this section in the blog post could help! It covers how to do tokenization in Swift with swift-transformers
.
import Tokenizers
func testTokenizer() async throws {
let tokenizer = try await AutoTokenizer.from(pretrained: "pcuenq/Llama-2-7b-chat-coreml")
let inputIds = tokenizer("Today she took a train to the West")
assert(inputIds == [1, 20628, 1183, 3614, 263, 7945, 304, 278, 3122])
}
The swift-transformers library is still new though, and @pcuenq will be making improvements to it to make it even easier! Perhaps he can add some extra context here too.
Hi @Ovats !
The swift-transformers
library will deal with many of those details automatically. I would recommend you take a look at the swift-chat
example app, which simply calls generate
with a prompt and a configuration object and swift-transformers
will do the rest. Under the hood, it will:
- Tokenize the prompt, using code similar to what @Xenova posted above.
- Invoke the model repeatedly, because language models produce one token at a time. For example, the
greedySearch
generation method uses a loop to get the most probable token each time, and it appends it to the output. - Prepare a suitable attention mask when necessary (not all models require it).
Please, let us know if that helps!
How can we use this model in
swift-chat
and target the ANE?
There's a great ANE repo here that discusses ways to get it on the ANE but it doesn't appear to be guaranteed. A lot of whether a model uses the ANE is a black box. But you can try!
@pcuenq do you have the code you used to convert the llama2-hf model to coreml? Or scripts?
Currently getting stuck here: https://github.com/huggingface/exporters/issues/76 (as well as any Llama2 based model)