Model Card for TinyLlama-abs2qa

This model was an experiment to see if I could get a model to generate useful questions from a scientific paper's abstract. The answer was yes!

Model Details

The base model is TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T, thanks to the TinyLlama devs for training and releasing it!

As such, it has a context size of 4096 tokens

Training data was a modified form of the QASPER train split, which contains 1169 examples of abstracts and suitable questions for NLP papers.

Model Description

I modified the QASPER dataset a little to do this training. The original has the abstract and a set of questions and their answers. For this test I only wanted to see if I could generate questions from abstracts, so I extracted only those parts and formulated them in an alpaca style instruction:

{"instruction":"Here is the the abstract for a scientific paper:
  It has been shown that word embeddings derived from large corpora 
  tend to incorporate biases present in their training data. Various 
  methods for mitigating these biases have been proposed, but recent 
  work has demonstrated that these methods hide but fail to truly 
  remove the biases, which can still be observed in word 
  nearest-neighbor statistics. In this work we propose a probabilistic
  view of word embedding bias. We leverage this framework to present a 
  novel method for mitigating bias which relies on probabilistic 
  observations to yield a more robust bias mitigation algorithm. 
  We demonstrate that this method effectively reduces bias according 
  to three separate measures of bias while maintaining embedding quality 
  across various popular benchmark semantic tasks
What would be some questions that the paper could answer?",
"output":"How is embedding quality assessed?
  What are the three measures of bias which are reduced in experiments?
  What are the probabilistic observations which contribute to the more robust algorithm?"}

I'm not sure how critical the instruction phrasing is, but with the instructions as in the training, this tiny model actually does a pretty good job on totally unseen abstracts in NLP.

Training this model with axolotl took only 3 minutes on an A100. Wrangling the environment to get axolotl to work took a lot longer and if you can I highly recommend using their docker.

Developed by: Andrew Green
Model type: Llama 2 architecture, 1.1B parameters
Language(s) (NLP): english
License: Apache 2.0
Finetuned from model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T

Uses

I intend to use this model or a derivative of it to screen papers for inclusion in literature summarisation tools in the future.

Another thing I want to try is using this model to augment QASPER for other fields.

Since it is so fast to train, I think it will also be a useful testbed for trying out some other techniques like DPO and SPIN that I want to learn.

Direct Use

Directly using this model should be possible, though some testing of the impact of slightly different prompting styles would be needed, and I think it will generate ad infinitum because I didn't use a chat template - that's on my to-do list and should be quick enough.

From a few quick tests, the generated questions look at least plausible, though they may have questionable utility in the real world

Out-of-Scope Use

The model was finetuned on scientific articles for NLP, and questions about the articles written by NLP experts. As such, it is quite likely the model will not work well on other fields. In my limited testing however, it does seem to generalise ok.

The same risks for misuse and malicious use apply as they would for any LLM, but in particluar this model has the potential to generate questions from an abstract, which could lead to it being misused in academia (e.g. to partially automate peer review). This would be a violation of most publisher's terms I think.

Bias, Risks, and Limitations

This model is based on the TinyLlama model, which is a foundation model so all the same risks of out of scope use there apply.

The model is biased towards NLP abstracts, because those are contained in the QASPER dataset on which it is trained.

This is a very small model, so it is likely to be quite limited in its reasoning capabilities, which may lead to nonsense or irrelevant questions being generated.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

afg1
/

tiny-llama-abs2qa