Model Card for Model ID
What is DOFA: DOFA is a unified multimodal foundation model for different data modalities in remote sensing and Earth observation.
Model Details
Differences with existing foundation models: DOFA is pre-trained using five different data modalities in remote sensing and Earth observation. It can handle images with any number of input channels.
DOFA is inspired by neuroplasticity Neuroplasticity is an important brain mechanism for adjusting to new experiences or environmental shifts. Inspired by this concept, we design DOFA to emulate this mechanism for processing multimodal EO data.
For more details, please take a look at the paper Neural Plasticity-Inspired Foundation Model for Observing the Earth Crossing Modalities.
Model Description
Why develop DOFA
The learned multimodal representation may not effectively capture such an intersensor relationship.
The performance of foundation models will degrade when downstream tasks require the utilization of data from unseen sensors with varying numbers of spectral bands and spatial resolutions or different wavelength regimes.
The development of individual, customized foundation models requires considerably more computing resources and human efforts.
The increasing number of specialized foundation models makes it difficult to select the most appropriate one for a specific downstream task.
DOFA supports input images with any number of channels using our pre-trained foundation models. The examples in the Github repo DOFA show how to use DOFA for Sentinel-1 (SAR), Sentinel-2, NAIP RGB. We will add example usage for Gaofen Multispectral, and Hyperspectral data soon.
- Developed by: Techinical University of Munich, Chair of Data Science in Earth Observation
- Funded by: Ekapex, ML4Earth
- Model type: Multimodal Foundation Model for Remote Sensing and Earth Observation
- License: CC-BY-4.0
Model Sources [optional]
- Repository: https://github.com/zhu-xlab/DOFA
- Paper [optional]: https://arxiv.org/abs/2403.15356
- Demo [optional]: https://github.com/ShadowXZT/DOFA-pytorch/blob/master/demo.ipynb
Table 1: Linear probing results on six classification tasks. All models are trained for 50 epochs. The reported numbers are top-1 overall accuracy (OA). Missing values are due to the inability of the model to adapt to this domain.
Method | Backbone | m-bigearthnet | m-forestnet | m-brick-kiln | m-pv4ger | m-so2sat | m-eurosat |
---|---|---|---|---|---|---|---|
Fully Trained | ViT-S | 66.0 | 53.8 | 98.1 | 97.6 | 57.5 | 97.3 |
Fully Trained | SwinV2-T | 70.0 | 58.0 | 98.7 | 98.0 | 56.1 | 97.4 |
Fully Trained | ConvNext-B | 69.1 | 56.8 | 98.9 | 98.0 | 58.1 | 97.7 |
rand. init. | ViT-B | 52.9 | 41.5 | 84.5 | 91.3 | 38.3 | 85.7 |
MAE_Single [44] | ViT-B | 63.6 | - | 88.9 | 92.2 | 50.0 | 88.9 |
OFA-Net [43] | ViT-B | 65.0 | - | 94.7 | 93.2 | 49.4 | 91.9 |
SatMAE [25] | ViT-B | 62.1 | - | 93.9 | - | 46.9 | 86.4 |
Scale-MAE [22] | ViT-L | - | - | - | 96.9 | - | - |
GFM [21] | Swin-B | - | - | - | 96.8 | - | - |
Cross-Scale MAE [23] | ViT-B | - | - | - | 93.1 | - | - |
FG-MAE [24] | ViT-B | 63.0 | - | 94.7 | - | 51.4 | 87.0 |
CROMA [27] | ViT-B | 67.4 | - | 91.0 | - | 49.2 | 90.1 |
DOFA | ViT-B | 65.7 | 50.9 | 95.8 | 96.9 | 55.1 | 93.9 |
DOFA | ViT-L | 67.5 | 54.6 | 96.9 | 97.3 | 60.1 | 97.1 |
Table 2: Partial fine-tuning results on six segmentation tasks. All models are trained with a frozen backbone for 20 epochs. Reported numbers are mean intersection over union (mIoU). Missing values are due to the inability of the model to adapt to this domain.
Method | Backbone | m-pv4ger-seg | m-nz-cattle | m-NeonTree | m-cashew-plant | m-SA-crop | m-chesapeake |
---|---|---|---|---|---|---|---|
DeepLabv3 | ResNet101 | 93.4 | 67.6 | 53.9 | 48.6 | 30.4 | 62.1 |
U-Net | ResNet101 | 94.1 | 80.5 | 56.6 | 46.6 | 29.9 | 70.8 |
rand. init. | ViT-B | 81.7 | 74.1 | 51.7 | 32.4 | 29.0 | 47.1 |
MAE_Single [44] | ViT-B | 88.4 | 76.4 | 53.0 | 40.7 | 30.7 | 51.9 |
OFA-Net [43] | ViT-B | 89.4 | 77.6 | 53.3 | 47.9 | 31.9 | 54.5 |
Scale-MAE [22] | ViT-L | 83.5 | 76.5 | 51.0 | - | - | 61.0 |
GFM [21] | Swin-B | 92.0 | 75.0 | 51.1 | - | - | 63.8 |
Cross-Scale MAE [23] | ViT-B | 83.2 | 77.9 | 52.1 | - | - | 52.3 |
CROMA [27] | ViT-B | - | - | - | 30.1 | 31.4 | - |
FG-MAE [24] | ViT-B | - | - | - | 40.8 | 30.6 | - |
DOFA | ViT-B | 94.5 | 81.4 | 58.8 | 51.5 | 33.0 | 65.3 |
DOFA | ViT-L | 95.0 | 81.8 | 59.4 | 56.9 | 32.1 | 66.3 |
Uses
Please refer to the Github repo DOFA for more details.
@article{xiong2024neural,
title={Neural Plasticity-Inspired Foundation Model for Observing the {Earth} Crossing Modalities},
author={Xiong, Zhitong and Wang, Yi and Zhang, Fahong and Stewart, Adam J and Hanna, Jo{\"e}lle and Borth, Damian and Papoutsis, Ioannis and Saux, Bertrand Le and Camps-Valls, Gustau and Zhu, Xiao Xiang},
journal={arXiv preprint arXiv:2403.15356},
year={2024}
}