MobileNet Baselines

Community Article Published July 26, 2024

Those who follow me know that I can't resist an opportunity to update an old baseline.

When the MobileNet-V4 paper came out I noted that they re-ran their MobileNet-V1 baseline to get a 74% ImageNet accuracy. The original models were around 71%. That's quite a jump.

Intruiged, I looked more closely at their recipe for the 'small' model with unusual optimizer hparams that brought the AdamW beta1 from the default 0.9 -> 0.6, taking it closer to RMSProp. Additionally, there was fairly high dropout and augmentation for a smaller model but a very long epoch count (9600 ImageNet-1k epochs in their case).

I set out to try these hparams myself in timm, initially in training a reproduction of the MobileNet-V4-Small where I successfully hit 73.8 at 2400 epochs (instead of 9600), I then took a crack at MobileNet-V1 as I'd never had that model in timm.

My MobileNet-V1 run just finished, 3600 ImageNet-1k epochs with a 75.4% top-1 accuracy on ImageNet at the 224x224 train resolution (76% at 256x256) -- no distillation, no additional data. The OOD dataset scores on ImageNet-V2, Sketch, etc seem pretty solid so it doesn't appear a gross overfit. Weights here: https://huggingface.co/timm/mobilenetv1_100.ra4_e3600_r224_in1k

Comparing to some other MobileNets:

Original MobileNet-V1 1.0
- Weights: by Google, https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet
- Accuracy: 70.9%, Param: 4.2M, GMAC: 0.6
Original MobileNet-V2 1.0
- Weights: by Google, https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet)
- Accuracy: 71.8%, Param: 3.5M GMAC: 0.3
MobileNet-V2 1.0
- Weights: by me in timm, https://huggingface.co/timm/mobilenetv2_100.ra_in1k
- Accuracy: 73.0%, Param: 3.5M, GMAC: 0.3
MobileNet-V2 1.0 (MNV4 Paper) - Accuracy: 73.4%, Param: 3.5M, GMAC: 0.3
Original MobileNet-V4 Small (MNV4 Paper) - Accuracy: 73.8%, Param: 3.8M, GMAC: 0.2
MobileNet-V4 Small
- Weights: by me in timm, https://huggingface.co/timm/mobilenetv4_conv_small.e2400_r224_in1k
- Accuracy: 73.8%, Param: 3.8M, GMAC: 0.2
MobileNet-V1 1.0 (MNV4 Paper) - Accuracy: 74.0%, Param: 4.2M, GMAC: 0.6
MobileNet-V2 1.1 w/ depth scaling
- Weights: by me in timm, https://huggingface.co/timm/mobilenetv2_110d.ra_in1k
- Accuracy: 75.0%, Param: 4.5M, GMAC: 0.4
MobileNet-V1
- Weights: This recipe, https://huggingface.co/timm/mobilenetv1_100.ra4_e3600_r224_in1k
- Accuracy: 75.4%, Param: 4.2M, GMAC: 0.6
MobileNet-V3 Large 1.0
- Weights: by Google, https://huggingface.co/timm/tf_mobilenetv3_large_100.in1k
- Accuracy: 75.5%, Param: 5.5M, GMAC: 0.2
MobileNet-V3 Large 1.0
- Weights: by me in timm, https://huggingface.co/timm/mobilenetv3_large_100.ra_in1k
- Accuracy: 75.8%, Param: 5.5M, GMAC: 0.2

I decided to give the old EfficientNet-B0 a go with these hparams. 78.6% top-1 accuracy. To put that in perspective the B0 trainings by top-1 are:

Original (Google, https://huggingface.co/timm/tf_efficientnet_b0.in1k) - 76.7
AutoAugment (Google, https://huggingface.co/timm/tf_efficientnet_b0.aa_in1k) - 77.1
AdvProp+AA (Google, https://huggingface.co/timm/tf_efficientnet_b0.ap_in1k) - 77.6
RandAugment (Me in timm, https://huggingface.co/timm/efficientnet_b0.ra_in1k) - 77.7
This MNV4 inspired recipe (https://huggingface.co/timm/efficientnet_b0.ra4_e3600_r224_in1k) - 78.6
NoisyStudent+RA (Google, https://huggingface.co/timm/tf_efficientnet_b0.ns_jft_in1k) - 78.8

So a pure ImageNet-1k with no distillation and no extra data managed just a hair under the very impressive NoisyStudent models which had unlabeled access to JFT. Additionally the OOD test set scores are holding up relative to NoisyStudent, that's also impressive. I actually think this recipe could be tweaked to push the B0 to 79%. The accuracy improvement petered out early on this run, there is room for improvement with a tweak to the aug+reg.

What were my differences from the MobileNet-V4 hparams? Well, for one I used timm, if you read the Supplementary Material, section A of the Resnet Strikes Back paper, I detailed a number of fixes and improvements over the default RandAugment that's used in all Tensorflow and most JAX based trainings I'm aware of. I feel some of the issues in the original are detremental to great training. Other differences?

Repeated Augmentation (https://arxiv.org/abs/1901.09335, https://arxiv.org/abs/1902.05509)
Small probability of random gaussian blur & random grayscale added in addition to RandAugment
Random erasing w/ guassian noise used instead of cutout, outside of RandAugment

So, the theme I've visited many times (Resnet Strikes Back, https://huggingface.co/collections/timm/searching-for-better-vit-baselines-663eb74f64f847d2f35a9c19, and many timm weights) continues to hold there is a lot of wiggle room for improving old results through better training regimens.

I wonder, in 7-8 years time how much can be added to todays SOTA 100+B dense transformer architectures with better recipes and training techniques.

Upvote