Model Card for WaLa-SK-1B

This model is part of the Wavelet Latent Diffusion (WaLa) paper, capable of generating high-quality 3D shapes from sketch inputs with detailed geometry and complex structures.

Model Details

Model Description

WaLa-SK-1B is a large-scale 3D generative model trained on a massive dataset of over 10 million publicly-available 3D shapes. It can efficiently generate a wide range of high-quality 3D shapes from sketch inputs in just 2-4 seconds. The model uses a wavelet-based compact latent encoding and a billion-parameter architecture to achieve superior performance in terms of geometric detail and structural plausibility.

Developed by: Aditya Sanghi, Aliasghar Khani, Chinthala Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani
Model type: 3D Generative Model
License: Autodesk Non-Commercial (3D Generative) v1.0

For more information please look at the Project Page and the paper.

Model Sources

Project Page: WaLa
Repository: Github
Paper: ArXiv
Demo: Colab

Uses

Direct Use

This model is released by Autodesk and intended for academic and research purposes only for the theoretical exploration and demonstration of the WaLa 3D generative framework. Please see here for inferencing instructions.

Out-of-Scope Use

The model should not be used for:

Commercial purposes
Creation of load-bearing physical objects the failure of which could cause property damage or personal injury
Any usage not in compliance with the license, in particular, the "Acceptable Use" section.

Bias, Risks, and Limitations

Bias

The model may inherit biases present in the publicly-available training datasets, which could lead to uneven representation of certain object types or styles.
The model's performance may degrade for object categories or styles that are underrepresented in the training data.

Risks and Limitations

The quality of the generated 3D output may be impacted by the quality and clarity of the input.
The model may occasionally generate implausible shapes, especially when the input is ambiguous or of low quality. Even theoretically plausible shapes should not be relied upon for real-world structural soundness.

How to Get Started with the Model

Please refer to the instructions here

Training Details

Training Data

The model was initially trained on the same dataset as the single-view model, consisting of over 10 million 3D shapes from 19 different publicly-available sub-datasets. It was then fine-tuned using synthetic sketch data generated using 6 different techniques.

Training Procedure

Preprocessing

Each 3D shape in the dataset was converted into a truncated signed distance function (TSDF) with a resolution of 256³. The TSDF was then decomposed using a discrete wavelet transform to create the wavelet-tree representation used by the model. Sketches were generated using various techniques including Grease Pencil, Canny edge detection, HED, and CLIPasso.

Training Hyperparameters

Training regime: Please refer to the paper.

Speeds, Sizes, Times

The model contains approximately 956 million parameters.
The model can generate shapes within 2-4 seconds.

Technical Specifications

Model Architecture and Objective

The model uses a U-ViT architecture with modifications. It employs a wavelet-based compact latent encoding to effectively capture both coarse and fine details of 3D shapes from sketch inputs.

Compute Infrastructure

Hardware

The model was trained on NVIDIA H100 GPUs.

Citation

@misc{sanghi2024waveletlatentdiffusionwala,
      title={Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings}, 
      author={Aditya Sanghi and Aliasghar Khani and Pradyumna Reddy and Arianna Rampini and Derek Cheung and Kamal Rahimi Malekshan and Kanika Madan and Hooman Shayani},
      year={2024},
      eprint={2411.08017},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.08017}, 
}