The proposed approach on one hand falls in line with many approaches for semantic segmentation training based on clustering (e.g. DeepCluster). Although, unlike these approaches, authors propose not to rely solely on the clustering iterative improvement. Authors also borrow the idea from equivariance learning (which also makes their work alike VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning and many more), and imply that vectors not only should be clustered together if they have close class, they also should be invariant to the color transformations and equivariant to the spatial transformations.
Assume we have neural network that produces pixel-wise embeddigns for the input image $\phi = f_\theta(x)$. Then we will define loss to cluster those embeddings and attract each embedding to it's closest centroid as
$$ \mathcal{L}{clust}(\phi, \mu)= \frac{e^{-cosine(\phi_i, \tilde{\mu})}}{\sum \limits{\nu \in \mu}e^{-cosine(\phi_i, \nu)}}; \tilde{\mu} = \underset{\mu}{\arg\min}\Vert \phi_i - \mu \Vert $$
, where $\mu$ is number of clusterization centroids, which nature we will clarify soon.This point is more or less typical for all such works.
Now, authors propose to add inductive bias to this loss. More precise, to add requirements for invariance of the embedding to the photometric augmentations $P(x)$ (changing colors, but not changing pixel positions) and equivariance to geometric transformations $G(x)$ (changing pixel positions). First of all, let say we have two versions of embeddings for one image $\phi_1 = f_\theta(G(P_1(x))); \phi_2 = G(f_\theta(P_2(x)))$, than we also perform clustering on each of them $\mu_1 = KMeans(\phi_1); \mu_2 = KMeans(\phi_2)$. Note that while having $P_1$ and $P_2$ sampled separately, $G$ is sampled only once, therefore if representations are equivariant to $G$ and invariant to $P$ we should have two equal representations. Though, to enforce the clustering in this representations, authors construct final loss as follows:
$$ \mathcal{L}{within} = \mathcal{L}{clust}(\phi_1, \mu_1) + \mathcal{L}{clust}(\phi_2, \mu_2) \\ \mathcal{L}{cross} = \mathcal{L}{clust}(\phi_1, \mu_2) + \mathcal{L}{clust}(\phi_2, \mu_1) \\ \mathcal{L}{total} = \mathcal{L}{within} + \mathcal{L}_{cross} $$
This way, they push representations that are clustered together in one view, to be clustered together in another view as well. Authors also propose a way to balance their loss with respect to how much clusters there is in representation. It is
$$ \mathcal{L} = \lambda_{K_1}\mathcal{L}{K_1} + \lambda{K_2}\mathcal{L}{K_2}; \lambda{K_1}=\frac{\log K_2}{\log K_1 + \log K_2}; \lambda_{K_2}=\frac{\log K_1}{\log K_2 + \log K_1} $$
Where $\mathcal{L}{K_i}$ **is any component $\mathcal{L}_{clust}(\circ,\mu_i)$.
Authors show qualitatively acceptable results of their alogrithm
as well as SoTA results compared to other unsupervised segmentation methods.
In ablation study they also show that their method can be employed as a pre-training as well.
From another part of the ablation study it seems, that the most important part of the loss is the term related to equivariance.