|
--- |
|
library_name: tf-keras |
|
license: other |
|
--- |
|
# Collection shoaib6174/video_swin_transformer/1 |
|
|
|
Collection of Video Swin Transformers feature extractor models. |
|
|
|
|
|
<!-- task: video-feature-extraction --> |
|
|
|
## Overview |
|
|
|
This collection contains different Video Swin Transformer [1] models. The original model weights are provided from [2]. There were ported to Keras models |
|
(`tf.keras.Model`) and then serialized as TensorFlow SavedModels. The porting steps are available in [3]. |
|
|
|
|
|
## About the models |
|
|
|
These models can be directly used to extract features from videos. These models are accompanied by |
|
Colab Notebooks with fine-tuning steps for action-recognition task and video-classification. |
|
|
|
The table below provides a performance summary: |
|
|
|
| model_name | pre-train dataset | fine-tune dataset | acc@1(%) | acc@5(%) | |
|
|:----------------------------------------------:|:-------------------:|:---------------------:|:----------:|----------:| |
|
| swin_tiny_patch244_window877_kinetics400_1k | ImageNet-1K | Kinetics 400(1k | 78.8 | 93.6 | |
|
| swin_small_patch244_window877_kinetics400_1k | ImageNet-1K | Kinetics 400(1k) | 80.6 | 94.5 | |
|
| swin_base_patch244_window877_kinetics400_1k | ImageNet-1K | Kinetics 400(1k) | 80.6 | 96.6 | |
|
| swin_base_patch244_window877_kinetics400_22k | ImageNet-12K | Kinetics 400(1k) | 82.7 | 95.5 | |
|
| swin_base_patch244_window877_kinetics600_22k | ImageNet-1K | Kinetics 600(1k) | 84.0 | 96.5 | |
|
| swin_base_patch244_window1677_sthv2 | Kinetics 400 | Something-Something V2| 69.6 | 92.7 | |
|
|
|
|
|
These scores for all the models are taken from [2]. |
|
|
|
|
|
|
|
### Video Swin Transformer Feature extractors Models |
|
|
|
* [swin_tiny_patch244_window877_kinetics400_1k](https://tfhub.dev/shoaib6174/swin_tiny_patch244_window877_kinetics400_1k) |
|
* [swin_small_patch244_window877_kinetics400_1k](https://tfhub.dev/shoaib6174/swin_small_patch244_window877_kinetics400_1k) |
|
* [swin_base_patch244_window877_kinetics400_1k](https://tfhub.dev/shoaib6174/swin_base_patch244_window877_kinetics400_1k) |
|
* [swin_base_patch244_window877_kinetics400_22k](https://tfhub.dev/shoaib6174/swin_base_patch244_window877_kinetics400_22k) |
|
* [swin_base_patch244_window877_kinetics600_22k](https://tfhub.dev/shoaib6174/swin_base_patch244_window877_kinetics600_22k) |
|
* [swin_base_patch244_window1677_sthv2](https://tfhub.dev/shoaib6174/swin_base_patch244_window1677_sthv2) |
|
|
|
|
|
|
|
## Notes |
|
|
|
The input shape for these models are `[None, 3, 32, 224, 224]` representing `[batch_size, channels, frames, height, width]`. To create models with different input shape use [this notebook](https://colab.research.google.com/drive/1sZIM7_OV1__CFV-WSQguOOZ8VyOsDaGM). |
|
|
|
## References |
|
[1] [Video Swin Transformer Ze et al.](https://arxiv.org/abs/2106.13230) |
|
[2] [Video Swin Transformers GitHub](https://github.com/SwinTransformer/Video-Swin-Transformerr) |
|
[3] [GSOC-22-Video-Swin-Transformers GitHub](https://github.com/shoaib6174/GSOC-22-Video-Swin-Transformers) |
|
|
|
## Acknowledgements |
|
* [Google Summer of Code 2022](https://summerofcode.withgoogle.com/) |
|
* [Luiz GUStavo Martins](https://www.linkedin.com/in/luiz-gustavo-martins-64ab5891/) |
|
* [Sayak Paul](https://www.linkedin.com/in/sayak-paul/) |