1University of Maryland, Baltimore County, 2University of California, Davis
∗ denote equal contribution
Most recent self-supervised learning (SSL) algorithms learn features by contrasting between instances of images or by clustering the images and then contrasting between the image clusters. We introduce a simple mean-shift algorithm that learns representations by grouping images together without contrasting between them or adopting much of prior on the structure of the clusters. We simply "shift" the embedding of each image to be close to the "mean" of its neighbors. Since in our setting, the closest neighbor is always another augmentation of the same image, our model will be identical to BYOL when using only one nearest neighbor instead of 5 as used in our experiments. Our model achieves 72.4% on ImageNet linear evaluation with ResNet50 at 200 epochs outperforming BYOL.
We introduce a simple but effective mean-shift algorithm to group similar images together in the neighborhood of each image in an online fashion. The idea is to simply find the nearest neighbors of a query image in the embedding space and pull the embedding of query to be closer to the center of those neighbors. We believe this process will result in developing clusters of images in the embedding space without enforcing much constraints about their specific size, number, or shape. Note that in contrast to grouping (pulling) in our method, MoCo pushes the query to be far from any other data points particularly nearest neighbors by which the loss will be dominated.
For two random query images, we show how the nearest neighbors evolve at the learning time. Initially, NNs are not semantically quite related, but are close in low-level features. The accuracy of 1-NN classifier in the initialization is 1.5% which is 15 times larger than random chance (0.01%). This little signal is bootstrapped in our learning method and results in NNs of the late epochs which are mostly semantically related to the query image.
Similar to BYOL, we maintain two encoders ("target" and "online") using momentum update for the target encoder. We augment an image twice and feed to both encoders. We add the target embedding to the memory bank and look for its nearest neighbors in the memory bank. Obviously target embedding itself will be the first nearest neighbor. We want to shift the query image towards the mean of target's nearest neighbors, so we minimize the summation of those distances. Note that our method using only one nearest neighbor is identical to BYOL which pulls different augmentations together without grouping different instances of images. To our knowledge, our method is the first in grouping different instances of images without contrasting between image instances or clusters.
We compare our model on the full ImageNet linear and nearest neighbor benchmarks using ResNet50. We find that given similar computational budget, our models are better than other state-of-the-art methods. Our w/s variation works slightly better than the regular MSF. Note that methods with symmetric loss are not directly comparable with the other ones as they need to feed each image twice though each encoder. This results in twice the computation for each mini-batch. One may argue that a non-symmetric BYOL with 200 epochs should be compared with symmetric BYOL with 100 epochs only as they use similar amount of computation. Note that symmetric MoCo v2 with 400 epochs is almost the same as asymmetric MoCo v2 with 800 epochs (71.0 vs. 71.1).
We compare our model on the ImageNet 1% and 10% linear evaluation benchmarks for ResNet50. The column "Fine-tuned" refers to whether the full network was fine-tuned or a single linear layer was trained. Given similar computational budgets, both of our models are better than other state-of-the-art methods.
We compare various SSL methods on transfer tasks by training linear layers. Under similar computational budgets, we show that our models are consistently better or on par with other state-of-the-art methods. Only a single linear layer is trained on top of features. No train time augmentations are used. "rep." means we have reproduced the results using our evaluation framework for better comparison.
We visualize the normalized features for 10 random ImageNet classes at certain epochs of MSF training. We find that over the period of training, semantic clusters are formed in the feature space.
For two random query images, we show how the nearest neighbors evolve at the learning time.
We cluster ImageNet dataset into 1000 clusters using k-means and show random samples from random clusters. Each row corresponds to a cluster. Note that semantically similar images are clustered together.