Another approach trying to develop losses which does not require neither tedious negative sampling procedures, nor vague architecture tricks. This one falls in line with the Barlow Twins approach presented by the same FAIR lab earlier.
Proposed loss explanation
The method itself is mostly about losses. Architecture is standard for constrastive learning — siamese network $f_\theta$ processing in different heads set different views $X,X'$ of the same images batch $I \sim D$. Where different views are obtained with different random augmentations $t,t' \sim T$. Projector network $f_\phi$ is relatively simple head, which will be removed after pre-training. Since for the loss it's mostly important to speak about results of projection network, I will only use notation $Z = f_{\theta,\phi}(X);Z' = f_{\theta,\phi}(X')$.
Now to the variance-invariance-covariance losses:
General loss per batch is the written as $\mathcal{L} = \lambda s(Z, Z') + \mu (v(Z) + v(Z')) + \nu (c(Z) + c(Z'))$. Besides rather large distinction between proposed hyper-parameters $\{\nu=1; \lambda = \mu = 25 \}$ traing hereby is straight-forward.
Authors pretrained their network on the ImageNet ILSVRC-2012 and validated their method on different supervised datasets. Providing evidence that their method is somewhat on par with SoTA methods, without requirement for tricky training procedures.
Linear evaluation on ImageNet classification
Results on different datasets: linear evaluation for classification and fine-tuning for detection.
In the ablation study authors show that their method does not require normalisation. Which was usual part of the self-supervised pre-trainings in variety of methods before.