Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
library_name: keras
|
4 |
+
---
|
5 |
+
# Collection shoaib6174/video_swin_transformer/1
|
6 |
+
|
7 |
+
Collection of Video Swin Transformers feature extractor models.
|
8 |
+
|
9 |
+
|
10 |
+
<!-- task: video-feature-extraction -->
|
11 |
+
|
12 |
+
## Overview
|
13 |
+
|
14 |
+
This collection contains different Video Swin Transformer [1] models. The original model weights are provided from [2]. There were ported to Keras models
|
15 |
+
(`tf.keras.Model`) and then serialized as TensorFlow SavedModels. The porting steps are available in [3].
|
16 |
+
|
17 |
+
|
18 |
+
## About the models
|
19 |
+
|
20 |
+
These models can be directly used to extract features from videos. These models are accompanied by
|
21 |
+
Colab Notebooks with fine-tuning steps for action-recognition task and video-classification.
|
22 |
+
|
23 |
+
The table below provides a performance summary:
|
24 |
+
|
25 |
+
| model_name | pre-train dataset | fine-tune dataset | acc@1(%) | acc@5(%) |
|
26 |
+
|:----------------------------------------------:|:-------------------:|:---------------------:|:----------:|----------:|
|
27 |
+
| swin_tiny_patch244_window877_kinetics400_1k | ImageNet-1K | Kinetics 400(1k | 78.8 | 93.6 |
|
28 |
+
| swin_small_patch244_window877_kinetics400_1k | ImageNet-1K | Kinetics 400(1k) | 80.6 | 94.5 |
|
29 |
+
| swin_base_patch244_window877_kinetics400_1k | ImageNet-1K | Kinetics 400(1k) | 80.6 | 96.6 |
|
30 |
+
| swin_base_patch244_window877_kinetics400_22k | ImageNet-12K | Kinetics 400(1k) | 82.7 | 95.5 |
|
31 |
+
| swin_base_patch244_window877_kinetics600_22k | ImageNet-1K | Kinetics 600(1k) | 84.0 | 96.5 |
|
32 |
+
| swin_base_patch244_window1677_sthv2 | Kinetics 400 | Something-Something V2| 69.6 | 92.7 |
|
33 |
+
|
34 |
+
|
35 |
+
These scores for all the models are taken from [2].
|
36 |
+
|
37 |
+
|
38 |
+
|
39 |
+
### Video Swin Transformer Feature extractors Models
|
40 |
+
|
41 |
+
* [swin_tiny_patch244_window877_kinetics400_1k](https://tfhub.dev/shoaib6174/swin_tiny_patch244_window877_kinetics400_1k)
|
42 |
+
* [swin_small_patch244_window877_kinetics400_1k](https://tfhub.dev/shoaib6174/swin_small_patch244_window877_kinetics400_1k)
|
43 |
+
* [swin_base_patch244_window877_kinetics400_1k](https://tfhub.dev/shoaib6174/swin_base_patch244_window877_kinetics400_1k)
|
44 |
+
* [swin_base_patch244_window877_kinetics400_22k](https://tfhub.dev/shoaib6174/swin_base_patch244_window877_kinetics400_22k)
|
45 |
+
* [swin_base_patch244_window877_kinetics600_22k](https://tfhub.dev/shoaib6174/swin_base_patch244_window877_kinetics600_22k)
|
46 |
+
* [swin_base_patch244_window1677_sthv2](https://tfhub.dev/shoaib6174/swin_base_patch244_window1677_sthv2)
|
47 |
+
|
48 |
+
|
49 |
+
|
50 |
+
## Notes
|
51 |
+
|
52 |
+
The input shape for these models are `[None, 3, 32, 224, 224]` representing `[batch_size, channels, frames, height, width]`. To create models with different input shape use [this notebook](https://colab.research.google.com/drive/1sZIM7_OV1__CFV-WSQguOOZ8VyOsDaGM).
|
53 |
+
|
54 |
+
## References
|
55 |
+
[1] [Video Swin Transformer Ze et al.](https://arxiv.org/abs/2106.13230)
|
56 |
+
[2] [Video Swin Transformers GitHub](https://github.com/SwinTransformer/Video-Swin-Transformerr)
|
57 |
+
[3] [GSOC-22-Video-Swin-Transformers GitHub](https://github.com/shoaib6174/GSOC-22-Video-Swin-Transformers)
|
58 |
+
|
59 |
+
## Acknowledgements
|
60 |
+
* [Google Summer of Code 2022](https://summerofcode.withgoogle.com/)
|
61 |
+
* [Luiz GUStavo Martins](https://www.linkedin.com/in/luiz-gustavo-martins-64ab5891/)
|
62 |
+
* [Sayak Paul](https://www.linkedin.com/in/sayak-paul/)
|