In this paper authors gain insight for the new loss from the way histopathologists work with images. Since the enormous scale of the images for histopathological research it is stored in pyramid-like structure with different zoom level, so researches tend to zoom in and out multiple times during work with image. Authors proposed two new things in this paper: 1) pre-training self-supervised loss and 2) semi-supervised teacher-student training paradigm to further train network in low-data regime.

Pipeline

Three step pipeline proposed in this paper

Authors proposed pipeline consisting of three steps:

Self-supervised pre-training on the $D_{u}$ — unlabeled dataset.
Task-specific supervised fine-tuning on the $D_l$ — labeled dataset.
Consistency training of the pre-trained network on $D_u \cup D_l$.

With step 2 being obvious, we will dive into 1. and 3.

Self-supervised pre-training loss

Authors proposed a novel self-supervised loss, along with the specific model architecture to train.

Additional pairwise feature extraction head on top of the network for easier training for this loss

Suppose we have $K$ possible magnification factors of the image ($K$ itself is arbitrary, though authors used $K=3$). We then enumerate all possible $K!$ permutations of magninfication factors and denote one as $\{\pi_i\}_{i \in [1,K!]}$.

Then, the training procedure is:

Select random spatial point on image and random magnification permutation id $k \in [1,K!]$.
Generate a set of concentric crops around selected point. We will denote them as $\{s_i\}{i \in [1,K]}$. Where each $s_i$ is cropped on the image magnified with factor $\pi{k,i}$
Forward pass each through the Feature Extraction network $f_{\theta}$, which could be any architecture to produce vector representation of an image (authors employed ResNet-18 for this task).
Construct a set of all (non-permuted (why, tho?)) concatenated pairs of representations $\{(f_\theta(s_i)\circ f_\theta(s_j))\}{i\in[1,K),j\in(i,K]}$ and forward pass each through $f\phi$ — simple MLP to mix the pair-wise result, representing relations between pairs of representations.
Concatenate all paired representations from previous step $f_\phi(f_\theta(s_1)\circ f_\theta(s_2))\circ...\circ f_\phi(f_\theta(s_{K-1}) \circ f_\theta(s_K))$ and pass it through a final classification head $f_\xi$ having $K!$ outputs.

All networks are then trained simultaneously by minimising $CE(f_{\theta,\phi,\xi}(\{s_i\}), k)$.

Authors argue, that to understand the magnification factors order, $f_{\theta}$ should learn meaningful representations.