innat commited on
Commit
95df5fb
•
1 Parent(s): 4b66528

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -3
README.md CHANGED
@@ -9,11 +9,47 @@ tags:
9
  - vision
10
  ---
11
 
 
12
 
13
-
14
-
15
-
16
 
 
17
  | Paper | Colab | HF Space | HF Hub |
18
  | :--: | :--: | :---: | :---: |
19
  | [![arXiv](https://img.shields.io/badge/arXiv-2106.13230-darkred)](https://arxiv.org/abs/2106.13230) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Q7A700MEI10UomikqjQJANWyFZktJCT-?usp=sharing) | [![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-yellow.svg)](https://huggingface.co/spaces/innat/VideoSwin) | [![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Hub-yellow.svg)](https://huggingface.co/innat/videoswin) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - vision
10
  ---
11
 
12
+ # [Video Swin Transformer : VideoSwin](https://github.com/innat/VideoSwin)
13
 
14
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/622dcfbee27c88667db09411/RYHUktzy8caODUOTrZ4LU.png)
 
 
15
 
16
+
17
  | Paper | Colab | HF Space | HF Hub |
18
  | :--: | :--: | :---: | :---: |
19
  | [![arXiv](https://img.shields.io/badge/arXiv-2106.13230-darkred)](https://arxiv.org/abs/2106.13230) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Q7A700MEI10UomikqjQJANWyFZktJCT-?usp=sharing) | [![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-yellow.svg)](https://huggingface.co/spaces/innat/VideoSwin) | [![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Hub-yellow.svg)](https://huggingface.co/innat/videoswin) |
20
+
21
+
22
+ VideoSwin is a pure transformer based video modeling algorithm, attained top accuracy on the major video recognition benchmarks. In this model, the author advocates an inductive bias of locality in video transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the [**Swin Transformer**](https://arxiv.org/abs/2103.14030) designed for the image domain, while continuing to leverage the power of pre-trained image models.
23
+
24
+
25
+ - GitHub: https://github.com/innat/VideoSwin
26
+
27
+ This is a unofficial `Keras` implementation of [Video Swin transformers](https://arxiv.org/abs/2106.13230). The official `PyTorch` implementation is [here](https://github.com/SwinTransformer/Video-Swin-Transformer) based on [mmaction2](https://github.com/open-mmlab/mmaction2).
28
+
29
+
30
+ ## Model Zoo
31
+
32
+ The 3D swin-video checkpoints are listed in [`MODEL_ZOO.md`](https://github.com/innat/VideoSwin/blob/main/MODEL_ZOO.md). Following are some hightlights.
33
+
34
+ ### Kinetics 400
35
+
36
+ In the training phase, the video swin mdoels are initialized with the pretrained weights of image swin models. In that case, `IN` referes to **ImageNet**.
37
+
38
+ | Backbone | Pretrain | Top-1 | Top-5 | #params | FLOPs | config |
39
+ | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
40
+ | Swin-T | IN-1K | 78.8 | 93.6 | 28M | ? | [swin-t](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py) |
41
+ | Swin-S | IN-1K | 80.6 | 94.5 | 50M | ? | [swin-s](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_small_patch244_window877_kinetics400_1k.py) |
42
+ | Swin-B | IN-1K | 80.6 | 94.6 | 88M | ? | [swin-b](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_base_patch244_window877_kinetics400_1k.py) |
43
+ | Swin-B | IN-22K | 82.7 | 95.5 | 88M | ? | [swin-b](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_base_patch244_window877_kinetics400_22k.py) |
44
+
45
+ ### Kinetics 600
46
+
47
+ | Backbone | Pretrain | Top-1 | Top-5 | #params | FLOPs | config |
48
+ | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
49
+ | Swin-B | IN-22K | 84.0 | 96.5 | 88M | ? | [swin-b](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_base_patch244_window877_kinetics600_22k.py) |
50
+
51
+ ### Something-Something V2
52
+
53
+ | Backbone | Pretrain | Top-1 | Top-5 | #params | FLOPs | config |
54
+ | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
55
+ | Swin-B | Kinetics 400 | 69.6 | 92.7 | 89M | ? | [swin-b](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_base_patch244_window1677_sthv2.py) |