Apply for community grant: Personal project

#1
by watchtowerss - opened
Visual Intelligence & Perception Lab org

Track Anything: A zero-shot application of Segment Anything in the video domain.
Recently, the Zheng Feng team from the Southern University of Science and Technology introduced Track Anything, an interactive tool based on the Segment Anything model released by Meta. This secondary creation model is suitable for tracking any target in the video domain. It allows pixel-level tracking of any target in any video with a simple mouse click, achieving interactivity, flexibility, and usability in target tracking.

The Segment Anything Model (SAM) is a new image segmentation foundation model recently proposed by Meta. It can generate masks for any object in any image or video, even including unseen objects and images. Upon its release, the model quickly caused a stir in the computer vision field and led to a series of "secondary creations" within the related community. However, we found that the current applications of Segment Anything are mainly concentrated in the image domain, while its application in the video domain remains largely unexplored, especially tracking/segmentation of video targets, which still poses a significant challenge. At the same time, the limitations of existing tracking models are apparent, such as the fact that scene switching is common in real videos, while existing tracking models can only achieve accurate target positioning in a single scene; existing models require accurate model initialization, necessitating human-provided target bounding boxes or precise masks, which greatly restrict their usability.

Our project proposes the Track Anything tool through an interactive combination of SAM and VOS models with the user as the intermediary. It has powerful video target tracking capabilities and provides a user-friendly interface. With just a few simple clicks, users can arbitrarily track one or multiple target areas of interest, flexibly adjust target objects according to user needs, and comes with user error correction and video editing functions. The technical process is as follows:

Based on SAM, users determine the spatial area of the target object in a certain frame of the video by selecting positive and negative sample points.
Use the user-defined area as a template to initialize the tracking model. During the tracking process, dense associations are established between each video frame and the template area to transfer user-selected information throughout the entire video, thereby achieving target tracking.
During the tracking process, users can pause at any time and, using the same positive and negative sample selection method, track other targets or correct the current target tracking results.
Based on the tracking results, users can use target erasure or video repair functions to edit specific areas within the video.

Uploading revengers-multimask.mp4…

Visual Intelligence & Perception Lab org

For applying grant.

Hi @watchtowerss ,
Did you request a GPU grant from huggingface to power this space? Maybe it was granted but only for a period.
Is there a way to apply again? :)
Best,
Mike

Sign up or log in to comment