Paper proposes versatile two-step approach of pixel-level embeddings training which could be used both for unsupervised segmentation, or as pre-training for semi-supervised segmentation. Authors argue, that the mid-range prior for training embeddings is better option in contrast to global (e.g. image-level contrastive learning) which tends to fail on not-single-object images and local (e.g. image colorisation), which tends to produce oversegmentation due to the attention to more local features.

Method

Two steps of proposed approach are mining of the saliency object masks and then pixel-level contrastive learning.

To mine the object mask proposals authors employed the saliency estimation. It could be both supervised and unsupervised, depending on the target of the pipeline and available information. In reported experiments BASNet and USPS architectures were used for supervised and unsupervised proposals respectively. Although, authors suppose that more advanced methods to generate area proposals could be beneficial.

To learn pixel embeddings authors proposed to maximise agreement between pixel embedding and mean embedding of object which contain this pixel. On the other hand, to avoid mode collapse authors propose to minimise agreement between pixel embedding and mean embeddings of other objects. If we denote output of pixel-embedding network as $\mathbf{z}$ and indices of pixels related to $j$-th object as $\mu_j$, we can write down loss as $\mathcal{L} = -\log \frac{exp(\mathbf{z}i \cdot \mathbf{z}{\mu_p} / \tau)}{\sum \limits_{n \ne p} exp(\mathbf{z}i \cdot \mathbf{z}{\mu_n} / \tau)}$, where $i \in \mu_p$ and $\mathbf{z}_{\mu_j}$ is average embedding inside the object mask.

Pixel embedding training procedure

Pixel embedding training procedure

Experiments

Authors reported SOtA on the PASCAL dataset across different versions of the embeddings pre-training and evaluations of those embeddings (both linear model on top of the embeddings and K-Means clustering was tested).

As a separate setup authors analysed the ability of trained model to be fine-tuned with the PASCAL dataset fraction to produce clustering on inference. This evaluation shows improvement over naïve ImageNet initialisation even for 100% of the training data.

Although, probably even more important, authors deeply analysed impact of different architecture choices. As an example which is good without long story to back it different mask proposal mechanisms could be chosen: