SandLogic Technologies - Quantized Phi-3.1-mini-4k-instruct Models

Model Description

We have quantized the Phi-3.1-mini-4k-instruct model into three variants:

Q5_KM
Q4_KM
IQ4_XS

These quantized models offer improved efficiency while maintaining performance.

Discover our full range of quantized language models by visiting our SandLogic Lexicon GitHub. To learn more about our company and services, check out our website at SandLogic.

Original Model Information

Name: Phi-3.1-mini-4k-instruct
Developer: Microsoft
Model Type: Open-source language model
Parameters: 3.8 billion
Context Length: 4,000 tokens
Training Data: 3.3 trillion tokens, including curated public documents, synthetic "textbook-like" data, and high-quality chat data
Language: English

Model Capabilities

The Phi-3.1-mini-4k-instruct model is designed for a variety of commercial and research applications, particularly in environments with limited memory or computational resources, scenarios requiring low latency, and tasks involving robust reasoning capabilities, such as mathematics and logic.

The model's key capabilities include:

Instruction following
Structured output generation
High-quality multi-turn conversations
Explicit support for the <|system|> tag
Improved reasoning capabilities

Use Cases

Environments with Limited Resources: Suitable for deployment on devices with limited memory or computational power, such as laptops, desktops, or edge devices.
Low-Latency Applications: Ideal for use cases where quick responses are critical, such as customer service chatbots or real-time text generation.
Mathematics and Logic-Based Tasks: Performs well on tasks requiring robust reasoning capabilities, including math problem-solving and logical inference.
Processing and Analyzing Long-Form Text: Able to handle and analyze large amounts of text efficiently.

Model Variants

We offer three quantized versions of the Phi-3.1-mini-4k-instruct model:

Q5_KM: 5-bit quantization using the KM method
Q4_KM: 4-bit quantization using the KM method
IQ4_XS: 4-bit quantization using the IQ4_XS method

These quantized models aim to reduce model size and improve inference speed while maintaining performance as close to the original model as possible.

Input and Output

Input: Text string (e.g., instructions, prompts, or long-form text)
Output: Generated text following the input, with structured output, improved reasoning, and adherence to the <|system|> tag

Usage

pip install llama-cpp-python

Please refer to the llama-cpp-python documentation to install with GPU support.

Basic Text Completion

Here's an example demonstrating how to use the high-level API for basic text completion:

from llama_cpp import Llama


llm = Llama(
  model_path="./Phi-3-mini-4k-instruct-q4.gguf",  # path to GGUF file
  n_ctx=4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=35, # The number of layers to offload to GPU, if you have GPU acceleration available. Set to 0 if no GPU acceleration is available on your system.
)

prompt = "How to explain Internet to a medieval knight?"

# Simple inference example
output = llm(
  f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
  max_tokens=256,  # Generate up to 256 tokens
  stop=["<|end|>"], 
  echo=True,  # Whether to echo the prompt
)

print(output['choices'][0]['text'])

Download

You can download Llama models in gguf format directly from Hugging Face using the from_pretrained method. This feature requires the huggingface-hub package.

To install it, run: pip install huggingface-hub

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="SandLogicTechnologies/Phi-3.1-mini-4k-instruct-GGUF",
    filename="*Phi-3.1-mini-4k-instruct-Q5_K_M.gguf",
    verbose=False
)

By default, from_pretrained will download the model to the Hugging Face cache directory. You can manage installed model files using the huggingface-cli tool.

License

Phi-3-mini-4k-instruct license - Phi-3

Acknowledgements

We thank the Microsoft team for developing and releasing the original Phi-3.1-mini-4k-instruct model. Special thanks to Georgi Gerganov and the entire llama.cpp development team for their outstanding contributions.

Contact

For any inquiries or support, please contact us at [email protected] or visit our support page.