Object-aware Contrastive Learning for Debiased Scene Representation

In this paper, the authors propose to modify Class Activation Map w.r.t. self-supervised losses and create ContraCAM. Thus allowing unsupervised object localization by network trained with self-supervised losses. With this localization in mind authors propose two augmentations to remove two typical biases met in self-supervised networks. They mitigate the over-reliance of the network on co-occurring objects by guiding random crop augmentation to include only one object (object-aware random crop). They mitigate the over-reliance of the network on the background of the object by soft-mixing random backgrounds for objects (background mixup). Both augmentations require object localization, but since authors already proposed how to do this localization in a self-supervised manner, all the method together stays as a self-supervised.

ContraCAM

The authors start with CAM and make two important substitutions: (1) replace classification loss with self-supervised loss and (2) throw out negative gradients. While the first change raises no questions, the second is the opaque yet practically supported change. This trick shows little to no success for the supervised case, but shows a great increase in quality in this self-supervised case.

$$ ContraCAM_{ij}(x) = Normalize(ReLU(\sum \limits_k \alpha_k A_{ij}^k(x))) $$

$$ \alpha_k = ReLU(\frac{1}{HW} \sum \limits_{ij} \frac{\partial \mathcal{L}{\theta}(x|x, \mathcal{B})}{\partial A{ij}^k(x)}) $$

Where, the main differences lie in the second line: authors added there ReLU and changed supervised classification loss with self-supervised contrastive loss. Authors used sample itself as a positive counterpart in this case. $A_{ij}^k$ is pixel with coordinates $i,j$ on the $k$-th channel of some (typically penultimate) layer of the NN, and $\mathcal{L}$ is a contrastive loss.

To further improve these masks authors propose iterative procedure, where on each iteration new input on step $t$ is formed as $x^t = (1-\overline{ContraCAM(x^{t-1})})$, where $\overline{ContraCAM}$ means aggregation by taking maximum over iterations for each pixel. This way authors expand this map from the simpliest descriptive areas to the more complicated ones, obtaining better cover of the object. This procedure goes for all samples in the batch simultaneously.

Untitled

In their experiments authors precompute all CAMs on trained NN, and do not compute them on-the-go. After this precomputation they arrange new training with proposed augmentations.

Object-aware Random Crop

For each image, authors propose to localize object. And when there is more than one object, they propose to make crops in such a way, that there is always only one on the cropped image.

Background Mixup

Authors propose to do soft-mixup via $x' = ContraCAM(x) \circ x + (1-ContraCAM(x)) \circ bg$. Where $bg$ is collected from the same dataset via cropping-out objects (again guided by the ContraCAM) and tiling left pieces of background.

Untitled

Experiments

Authors show that ContraCAM is working qualitatively on different object-centered datasets. Which are quite good results without supervision.

Untitled

Also, authors show qualitative overview of the benefits of negative removal

Untitled

As well as quantitative results for unsupervised object localisation. Not only this method provides SoTA results among other unsupervised localisations, it is on par or better than even supervised CAM, produced by trained classifier.