Another approach trying to develop losses which does not require neither tedious negative sampling procedures, nor vague architecture tricks. This one falls in line with the Barlow Twins approach presented by the same FAIR lab earlier.

Method

Proposed loss explanation

Proposed loss explanation

The method itself is mostly about losses. Architecture is standard for constrastive learning — siamese network $f_\theta$ processing in different heads set different views $X,X'$ of the same images batch $I \sim D$. Where different views are obtained with different random augmentations $t,t' \sim T$. Projector network $f_\phi$ is relatively simple head, which will be removed after pre-training. Since for the loss it's mostly important to speak about results of projection network, I will only use notation $Z = f_{\theta,\phi}(X);Z' = f_{\theta,\phi}(X')$.

Now to the variance-invariance-covariance losses:

  1. Variance — enforces standard deviation of the predicted embeddings along each axis to be not less than threshold value $\gamma$: $v(Z) = \frac{1}{d} \sum \limits^{d}{j=1} max(0, \gamma - \sqrt{Var(Z{*,j}) + \epsilon})$. Authors point, that this is important to use standard deviation instead of the variance itself. While variance has quadratic dependence on the average distance between the samples. And therefore fails to provide gradients in the most important cases close to collapsing.
  2. Invariance — is classical contrast learning loss, although without reliance on the negative samples. It only requires embeddings of different views of the same image to be as close as possible. $s(Z,Z') = \frac{1}{n} \sum \limits_i \Vert Z_i - Z_i' \Vert_2^2$.
  3. Covariance — is redundancy reduction loss inspired by Barlow Twins method. It decreases encoded information redundancy by decorrelating different dimensions of learned representation. If we define covariance matrix as $C(Z) = \frac{1}{n-1} \sum \limits_i (Z_i - \bar Z)(Z_i - \bar Z)^T$ we can then define loss which just push covariance between different dimensions towards zero: $c(Z) = \frac{1}{d} \sum \limits_{i \neq j} C(Z)^2_{i,j}$.

General loss per batch is the written as $\mathcal{L} = \lambda s(Z, Z') + \mu (v(Z) + v(Z')) + \nu (c(Z) + c(Z'))$. Besides rather large distinction between proposed hyper-parameters $\{\nu=1; \lambda = \mu = 25 \}$ traing hereby is straight-forward.

Results

Authors pretrained their network on the ImageNet ILSVRC-2012 and validated their method on different supervised datasets. Providing evidence that their method is somewhat on par with SoTA methods, without requirement for tricky training procedures.

Linear evaluation on ImageNet classification

Linear evaluation on ImageNet classification

Results on different datasets: linear evaluation for classification and fine-tuning for detection.

Results on different datasets: linear evaluation for classification and fine-tuning for detection.

In the ablation study authors show that their method does not require normalisation. Which was usual part of the self-supervised pre-trainings in variety of methods before.