README.md · NousResearch/Nous-Capybara-7B-V1-GGUF at main

metadata

language:
  - eng
tags:
  - llama-2
  - sft
license:
  - mit
datasets:
  - LDJnr/Capybara
  - LDJnr/LessWrong-Amplify-Instruct
  - LDJnr/Pure-Dove
  - LDJnr/Verified-Camel

Nous-Capybara-7B-GGUF

A model created with a novel synthesis method in mind, Amplify-instruct, with a goal of having a synergistic combination of different techniques used for SOTA models such as Evol-Instruct, Orca, Vicuna, Lamini, FLASK and others, all into one lean holistically formed dataset and model. The seed instructions used for the start of synthesized conversations are largely based on highly acclaimed datasets like Airoboros, Know logic, EverythingLM, GPTeacher and even entirely new seed instructions derived from posts on the website LessWrong, as well as being supplemented with certain multi-turn datasets like Dove(A successor to Puffin).

Entirely contained within 20K training examples, mostly comprised of newly synthesized tokens never used for model training until now!

Process of creation and special thank yous!

This model was fine-tuned by Nous Research, with LDJ leading the training and dataset curation, along with significant dataset formation contributions by J-Supha, Also thank you to Emozilla for also assisting to expedite the training experimentation process.

Special thank you to A16Z for sponsoring our training, as well as Yield Protocol for their support in resources during R&D of aspects outside of training, such as dataset development/synthesis.

Thank you to dataset creators!

While most of the tokens within Capybara are newly synthsized and part of datasets like Puffin/Dove, we would like to credit the single-turn datasets we leveraged as seeds that are used to initiate the beggining of many of the multi-turn conversations:

Model Training

Nous-Capybara 7B is a new model trained for multiple epochs on a dataset of less than 20,000 carefully curated GPT-4 examples, most of which are long context conversations between a real human and GPT-4 comprised of entirely newly synthesized tokens that previously didn't exist on HuggingFace.

Additional data came from manually curated CamelAI data, with the help of volunteers ranging from former Physicists, Mathematicians, Biologists and more!

Specific credits to the people involved in validating this data will be posted soon :)

Prompt Format

The reccomended model usage is:

USER:

ASSISTANT:

Notable Features:

The first Nous model trained on over 10,000 multi-turn conversations.
Over 1,000 tokens average per conversation example during training!
Able to effectively do complex summary of advanced studies on topics.
Ability to recall information upto late 2022 without internet (ChatGPT cut off date is in 2021)
Context length of 4096 tokens, and fine-tuned on a significant amount of multi-turn conversations reaching that full token limit.
Includes a portion of conversational data synthesized from less wrong posts, speaking in-depth about the nature of rationality, reasoning and self-improvement.

Example Outputs!:

Benchmarks! (Important to note that all mentioned benchmarks are single-turn and don't test multi-turn capabilities, Capybara should excel even further at multi-turn conversational tasks.)

Limitations

We noticed that the current version of Capybara still has some issues in some situations with censoring itself and not acting as expected in certain edge cases, we plan to have this largely resolved in the near future with Capybara 1.1

Future Changes

This is a relatively early build amongst the grand plans for the future of Capybara!

Current limitations: We are still running experimentation and tests for the training pipeline and dataset cleaning process to be more refined, we plan to release a Capybara 1.1 with these improvements.

Future model sizes

We plan on releasing a 3B, 13B and 70B version, as well as a potential 1B version based on phi-1.5 or similar architectures.

How you can help!

In the near future we plan on leveraging the help of domain specific expert volunteers to eliminate any mathematically/verifiably incorrect answers from our training curations.

If you have at-least a bachelors in mathematics, physics, biology or chemistry and would like to volunteer even just 30 minutes of your expertise time, please contact LDJ on discord!

Dataset contamination.

We checked for 100%, 99%, 98% and 97% similarity matches between our data and many popular benchmarks, we found no exact matches!

The following are benchmarks we checked for contamination for:

HumanEval
AGIEval
TruthfulQA
MMLU
GPT4All

Citation:

@article{daniele2023amplify-instruct,
  title={Amplify-Instruct: Synthetically Generated Diverse Multi-turn Conversations for Effecient LLM Training.},
  author={Daniele, Luigi and Suphavadeeprasit},
  journal={arXiv preprint arXiv:(comming soon)},
  year={2023}
}