Tonic
/

video-swin-transformer

Model card Files Files and versions Community

video-swin-transformer / README.md

Wauplin's picture

Wauplin HF staff

Set `library_name` to `tf-keras`.

4ec4156 verified 4 months ago

|

3.35 kB

	---
	library_name: tf-keras
	license: other
	---
	# Collection shoaib6174/video_swin_transformer/1

	Collection of Video Swin Transformers feature extractor models.


	<!-- task: video-feature-extraction -->

	## Overview

	This collection contains different Video Swin Transformer [1] models. The original model weights are provided from [2]. There were ported to Keras models
	(`tf.keras.Model`) and then serialized as TensorFlow SavedModels. The porting steps are available in [3].


	## About the models

	These models can be directly used to extract features from videos. These models are accompanied by
	Colab Notebooks with fine-tuning steps for action-recognition task and video-classification.

	The table below provides a performance summary:

	\| model_name \| pre-train dataset \| fine-tune dataset \| acc@1(%) \| acc@5(%) \|
	\|:----------------------------------------------:\|:-------------------:\|:---------------------:\|:----------:\|----------:\|
	\| swin_tiny_patch244_window877_kinetics400_1k \| ImageNet-1K \| Kinetics 400(1k \| 78.8 \| 93.6 \|
	\| swin_small_patch244_window877_kinetics400_1k \| ImageNet-1K \| Kinetics 400(1k) \| 80.6 \| 94.5 \|
	\| swin_base_patch244_window877_kinetics400_1k \| ImageNet-1K \| Kinetics 400(1k) \| 80.6 \| 96.6 \|
	\| swin_base_patch244_window877_kinetics400_22k \| ImageNet-12K \| Kinetics 400(1k) \| 82.7 \| 95.5 \|
	\| swin_base_patch244_window877_kinetics600_22k \| ImageNet-1K \| Kinetics 600(1k) \| 84.0 \| 96.5 \|
	\| swin_base_patch244_window1677_sthv2 \| Kinetics 400 \| Something-Something V2\| 69.6 \| 92.7 \|


	These scores for all the models are taken from [2].



	### Video Swin Transformer Feature extractors Models

	* [swin_tiny_patch244_window877_kinetics400_1k](https://tfhub.dev/shoaib6174/swin_tiny_patch244_window877_kinetics400_1k)
	* [swin_small_patch244_window877_kinetics400_1k](https://tfhub.dev/shoaib6174/swin_small_patch244_window877_kinetics400_1k)
	* [swin_base_patch244_window877_kinetics400_1k](https://tfhub.dev/shoaib6174/swin_base_patch244_window877_kinetics400_1k)
	* [swin_base_patch244_window877_kinetics400_22k](https://tfhub.dev/shoaib6174/swin_base_patch244_window877_kinetics400_22k)
	* [swin_base_patch244_window877_kinetics600_22k](https://tfhub.dev/shoaib6174/swin_base_patch244_window877_kinetics600_22k)
	* [swin_base_patch244_window1677_sthv2](https://tfhub.dev/shoaib6174/swin_base_patch244_window1677_sthv2)



	## Notes

	The input shape for these models are `[None, 3, 32, 224, 224]` representing `[batch_size, channels, frames, height, width]`. To create models with different input shape use [this notebook](https://colab.research.google.com/drive/1sZIM7_OV1__CFV-WSQguOOZ8VyOsDaGM).

	## References
	[1] [Video Swin Transformer Ze et al.](https://arxiv.org/abs/2106.13230)
	[2] [Video Swin Transformers GitHub](https://github.com/SwinTransformer/Video-Swin-Transformerr)
	[3] [GSOC-22-Video-Swin-Transformers GitHub](https://github.com/shoaib6174/GSOC-22-Video-Swin-Transformers)

	## Acknowledgements
	* [Google Summer of Code 2022](https://summerofcode.withgoogle.com/)
	* [Luiz GUStavo Martins](https://www.linkedin.com/in/luiz-gustavo-martins-64ab5891/)
	* [Sayak Paul](https://www.linkedin.com/in/sayak-paul/)