Demonstration of Cross-modal Retrieval (FLIP-based model)

FLIP (Facial Language Image Pretraining)

This repository is the official implementation of FaceCaption-15M.

Updates：

[24/07/20] The usage of FLIP has been released! OpenFace-CQUPT/FLIP-demo

[24/07/17] The model named FLIP has been released! OpenFace-CQUPT/FLIP

Overview of FLIP architecture.

Fig.1:(a). Same color represents shared parameters. “12x” stands for 12-layer transformer modules. (b), (c) and (d) FLIP-based model are applied to the tasks of text-image retrieval, facial attributes prediction and sketch less facial image retrieval, respectively.

Training

Coming soon......（Only for the datasets been published, the code of training is meaningful.）

python pretrain.py > log.log

Pre-trained Models

We provide pretrained model weights :
FLIP Base —— click here
FLIP Large —— coming soon......

Datasets

Download the FaceCaption-15M dataset from here.

Results

Task1: Text-Image Retrieval

Table 1: Comparison with other classical pretrained models. All pretrained model backbones are frozen, with only the linear layer being fine-tuned. † represents the model pretrained on the LAION-Face [86] dataset; * represents the model pretrained on the FaceCaption dataset constructed without using LLM text generation.

Task2: Facial Attributes Prediction

Table 2: Comparison with other classical models. † represents the model pre-trained on the original LAION-Face dataset.

Task3: Sketch Less Facial Image Retrieval

Table 3: Comparative results with different baseline methods. † represents the model pre-trained on the LAION-Face dataset.

Fig.2:Demonstration of our FLIP-based model on the SLFIR task. Both methods can retrieve the target face photo from the top-5 list using a partial sketch. Our proposed FLIP-based model can achieve this using fewer strokes than the baseline. The number at the bottom denotes the rank of the paired (true match) photos at every stage.

Contacts

mailto: [email protected] or [email protected]

Citation

@misc{dai202415mmultimodalfacialimagetext,
      title={15M Multimodal Facial Image-Text Dataset}, 
      author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang},
      year={2024},
      eprint={2407.08515},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.08515}, 
}

OpenFace-CQUPT
/

FLIP