Model Card for Diva Llama 3

This is an ablation of our Distilled Voice Assistant (DiVA) model which can handle speech and text as inputs. This ablation is trained using only distillation loss as described in the ablations here: https://huggingface.co/papers/2410.02678

Weights and Biases Run: https://wandb.ai/i18nlp/DiVA%20Training%20Runs/runs/8i1dd47i?nw=nwuserheld

Citation

This is the distillation only model from https://huggingface.co/papers/2410.02678: BibTeX:

    @misc{held2024diva,
      author="Held, Will and Zhang, Yanzhe and Ryan, Michael and Shi, Weiyan and Li, Ella and Yang, Diyi",
      title="Distilling an End-to-End Voice Assistant from Speech Recognition Data",
      year="2024",
      publisher="HuggingFace",
    }

Model Card for DiVA Llama 3
Citation
Table of Contents
Training Details
- Training Data
- Training Procedure
Environmental Impact
Technical Specifications [optional]
- Model Architecture and Objective
- Compute Infrastructure
  - Hardware
  - Software
Model Card Contact

Training Details

Training Data

This model was trained on the CommonVoice corpus.

Training Procedure

This model was trained for 7k gradient steps with a batch size of 512 Recordings and a linearly decaying learning rate from 5e-5 to zero, with a linear warmup of 70 steps.

Environmental Impact

Hardware Type: V4-32 TPU
Hours used: 8 Hours
Cloud Provider: Google Cloud.
Compute Region: US Central C

Hardware

This model was trained on at V4 TPU on Google Cloud.

Software

This model was trained with Levanter

Model Card Authors [optional]

Will Held

Model Card Contact

[email protected]