How to deploy?
I know this might sound ridiculous to somebody who works with models on a daily basis, but I am a bloody beginner in machine learning, AI and such with a somewhat okay fundament in Python and was wondering how to deploy your models.
Are there usually other files for the tokenizer etc. with the model that are missing here?
Or if you could maybe point me to a good source of information how to deploy llama2 models locally using python, I would very much appreciate it. I am struggling to find a good entrance point into AI models, how to deploy, train and finetune them.
Either the information sources are so advanced that I only understand gibberish and a lot of knowledge is assumed or it is dumbed down so much that it doesn't really help (pre-chewn solutions like web-UI and one-click-solutions are not what I am looking for)
I know this might be the wrong place to ask, but I really would love to deploy this particular model locally within python. The goal is to create a STT > Conversational Model > TTS allowing me to communicate with my personal AI assistant running locally on my GPU. I got the STT running, now I am looking into how to deploy a slim model locally.
Thank you for your time <3 and sorry for the inconvenience caused by my stupid questions!
There's two Python libraries for running inference from GGML files. llama-cpp-python, andctransformers. ctransformers is intended to be very similar to Huggingface's Transformers library, and therefore supports directly downloading a model file from Huggingface Hub and immediately running inference on it. Both support GPU acceleration, although they require that you compile it with the CUDA toolkit to get that.
I've had plans to add example Python code to my GGML readmes for how to do Python inference from GGML, but haven't had a chance to yet. But if you follow the links above or in the README to those repos, they have example code there.
No, there are no files missing here. No other files are needed. With GGML, the tokenizer is built in to the GGML file. The files in this repo are a selection of different quant formats. You pick one, according to your hardware, and your desired compromise between speed and model accuracy. The bigger the file, the better the accuracy, but the more hardware required and the slower the inference. Though as this is a 7B model, none of them will be 'slow'.
Note that although you may not want a UI, text-generation-webui does provide an API which can be interfaced from Python (or any language). So that's often a way people get going quickly with these models. You can use text-generation-webui for the model loading, meaning you can use GPTQ or GGML models (GPTQ being faster if you have the hardware, GGML being capable of running on smaller hardware, eg in situations where you don't have enough VRAM for the model size you want to run).
I provide a Runpod template that can be launched on any of their cloud GPUs, and provides an externally accessible API link that could be used from whatever code: https://github.com/TheBlokeAI/dockerLLM
As for tutorials to quickly get you started with inference, there's a lot on YouTube, but YouTubers like AIEntrepreneur and Matthew Berman. That will be UI-based inference, but as mentioned once you have inference working in general, you could then use text-generation-webui as an API server.
For an introduction to pure Python inference, check out the videos of Sam Witteveen. He won't ever use GGML, but he will show Python code (via shared Google Colab notebooks) for doing Transformers inference. That will give you a grounding in how that works, and you could run them on Google Colab for free to see how it works, before moving on to trying it yourself with quantised models like these GGMLs, or GPTQs.
Thank you so much for all the valuable info and time taken to show me tha way!
After my return from my Kenting vacation I will get right to it! Now I should have all I need to get started thanks to your great work and help!
I guess I will go with the GGML approach, since I was planning on running it on my RTX 3070ti Laptop GPU and will need some VRAM for the TTS (maybe Bark or TorToiSe, not sure which one produces a good result at low VRAM cost) and STT (Whisper-base). The reason I wanted to try to avoid hosting it anywhere else than on my Laptop was due to privacy when using an uncensored model. It can churn out quite the illegal content. The basic idea was to have kind of a personalized AI "friend"/assistant/waifu or antagonist. No need for it to be really capable of anything but talking bs with me and maybe performing some google searches and writing stuff into text files. Its rather meant as a gimmick, something that entertains me from time to time while doing my Blender and Stable Diffusion work and in the future maybe the option to finetune it on the chat history, so it becomes more personalized.
I was planning on using VTuber or some small Unity thingy in the making to give it a rigged and animated character with facial shape keys. If I had a more powerful GPU, I would also try and run a sentiment analysis on my input to incorporate a more emotional response from the AI into the process, but for now having a small AI bugger to joke around with is already the peak of my expectations - I bet even a more inferior model would have done the trick actually and even being able to have it retain a specific persona without changing back into a boring AI servant would already be awesome. Wish I had found my interest in AI a couple of years ago. Learning to work with this technology from the early days wouldve made things a bit easier. Feels like I could fill a dictionary with all the variety that is already out there :D.
Again, thank you for your help! Stay healthy and take care!