Papers
arxiv:2309.05447

TeGit: Generating High-Quality Instruction-Tuning Data with Text-Grounded Task Design

Published on Sep 11, 2023
Authors:
,
,
,
,

Abstract

High-quality instruction-tuning data is critical to improving LLM capabilities. Existing data collection methods are limited by unrealistic manual labeling costs or by the hallucination of relying solely on LLM generation. To address the problems, this paper presents a scalable method to automatically collect high-quality instructional adaptation data by training language models to automatically design tasks based on human-written texts. Intuitively, human-written text helps to help the model attenuate illusions during the generation of tasks. Unlike instruction back-translation-based methods that directly take the given text as a response, we require the model to generate the instruction, input, and output simultaneously to filter the noise. The results of the automated and manual evaluation experiments demonstrate the quality of our dataset.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.05447 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2309.05447 in a Space README.md to link it from this page.

Collections including this paper 2