Spaces:

clip-italian
/

clip-italian-demo

Running

App Files Files Community

vinid commited on Jul 18, 2021

Commit

f1abd41

•

1 Parent(s): cf1218c

adding README.md and changing name to readme.md

Browse files

Files changed (3) hide show

README.md +34 -0
app.py +1 -1
readme.md → introduction.md +17 -9

README.md ADDED Viewed

	@@ -0,0 +1,34 @@

+---
+title: Clip Italian Demo
+emoji: ⚡
+colorFrom: gray
+colorTo: pink
+sdk: streamlit
+app_file: app.py
+pinned: false
+---
+# Configuration
+`title`: _string_
+Display title for the Space
+`emoji`: _string_
+Space emoji (emoji-only character allowed)
+`colorFrom`: _string_
+Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
+`colorTo`: _string_
+Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
+`sdk`: _string_
+Can be either `gradio` or `streamlit`
+`app_file`: _string_
+Path to your main application file (which contains either `gradio` or `streamlit` Python code).
+Path is relative to the root of the repository.
+`pinned`: _boolean_
+Whether the Space stays on top of your list.

app.py CHANGED Viewed

@@ -108,5 +108,5 @@ if query:
     st.image(image_paths)
-intro_markdown = read_markdown_file("readme.md")
 st.markdown(intro_markdown, unsafe_allow_html=True)

     st.image(image_paths)
+intro_markdown = read_markdown_file("introduction.md")
 st.markdown(intro_markdown, unsafe_allow_html=True)

readme.md → introduction.md RENAMED Viewed

@@ -1,6 +1,8 @@
 # Italian CLIP
-With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples.
 In building this project we kept in mind the following principles:
@@ -32,12 +34,12 @@ We considered three main sources of data:
 [Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions described in the paper as they are
 the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
 However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
-  On the other hand, this text is written in Italian and it is of good quality.
-  To prevent polluting the data with captions that are not meaningful, we used *POS tagging*
-  on the text and removed all the captions that were composed for the 80% or more by PROPN. This is a simple solution that allowed us to retain much
   of the dataset, without introducing noise.
-  Example: ....
 + [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions comes from the original
 MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
@@ -60,11 +62,16 @@ training pipeline: the optimizer and the training with frozen components.
 ### Optimizer
-The standard AdamW didn't seem enough to train the model...
 ### Backbone Freezing
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="200"/>
 # Scientific Validity
@@ -107,7 +114,8 @@ on 400million images (and some of them probably were from MSCOCO).
 ### Zero-shot image classification
-This experiment replicates the original one run by OpenAI on zero-shot image classification.
 | Accuracy        | CLIP-Italian | mCLIP |
@@ -121,7 +129,7 @@ This experiment replicates the original one run by OpenAI on zero-shot image cla
 Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
 we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
-paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)), considering that our results are in line with those obtained by mCLIP we think that
 the translated image labels might have had an impact on the final scores.
 ## Qualitative Evaluation

 # Italian CLIP
+With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model
+is built upon the [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI
+[vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
 In building this project we kept in mind the following principles:
 [Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions described in the paper as they are
 the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
 However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
+  On the other hand, this text is written in Italian and it is of good quality. We cannot just remove short captions as some of those
+  are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
+  on the text and removed all the captions that were composed for the 80% or more by PROPN (around ~10% of the data). This is a simple solution that allowed us to retain much
   of the dataset, without introducing noise.
+  Captions like: *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
 + [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions comes from the original
 MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
 ### Optimizer
+The standard AdamW didn't seem enough to train the model and thus we opted for a different optimization strategy. We eventually used AdaBelief with AGC and Cosine Annealing.
+Our implementation is available online [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/run_hybrid_clip.py#L667).
 ### Backbone Freezing
+The ViT used by OpenAI was already trained on 400million images and it is the element in our architecture that probably required less training.
+The same is true for the BERT model we use. Thus, we decided to do a first training with the backbone of our architecture completely frozen, to allow
+the deeper layer to adapt to the new setting. Eventually, we run a new training, by fine-tuning al the components. This technique allowed us to
+reach a much better validation loss.
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="200"/>
 # Scientific Validity
 ### Zero-shot image classification
+This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet. To do this, we used DeepL to
+translate the image labels in ImageNet with DeepL. We evaluate the models computing the accuracy.
 | Accuracy        | CLIP-Italian | mCLIP |
 Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
 we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
+paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
 the translated image labels might have had an impact on the final scores.
 ## Qualitative Evaluation