Abstract
Self-supervised learning (SSL) offers a compelling solution to the challenge of extensive labeled data requirements in traditional supervised learning. With the proven success of Vision Transformers (ViTs) in supervised tasks, there is increasing interest in adapting them for SSL frameworks. However, the high computational demands of SSL pose substantial challenges, particularly on resource-limited platforms like edge devices, despite its ability to achieve high accuracy without labeled data. Recent studies in supervised learning have shown that token pruning can reduce training costs by removing less informative tokens without compromising accuracy. However, SSL’s dual-branch encoders make traditional single-branch pruning strategies less effective, as they fail to account for the critical cross-branch similarity information, leading to reduced accuracy in SSL. To this end, we introduce SimPrune, a novel token pruning strategy designed for ViTs in SSL. SimPrune leverages cross-branch similarity information to efficiently prune tokens, retaining essential semantic information across dual branches. Additionally, we incorporate a difficulty-aware pruning strategy to further enhance SimPrune's effectiveness. Experimental results show that our proposed approach effectively reduces training computation while maintaining accuracy. Specifically, our approach offers 24% savings in training costs compared to SSL baseline, without sacrificing accuracy.
Background: Discriminative SSL
Discriminative self-supervised learning trains a model by aligning representations from two different augmented views of the same image.
A typical SSL framework uses an online branch and a target branch. The online branch is optimized by back-propagation, while the target branch is updated by Momemtum-based method or even not trained.
This dual-branch structure is the key difference from supervised learning.
Why SSL Needs Cross-Branch Pruning
We first test whether a representative single-branch, attention-based token pruning method is effective for discriminative self-supervised learning.
| Method | Training FLOPs | SSL Accuracy | SSL Drop | Supervised Drop |
|---|---|---|---|---|
| DINO baseline | 100% | 57.16 | - | - |
| Attention-based pruning, kr = 0.9 | 87% | 56.65 | -0.51 | -0.10 |
| Attention-based pruning, kr = 0.8 | 76% | 54.68 | -2.48 | -0.21 |
Reason
Supervised learning uses a single image stream and explicit labels, so token importance can be estimated within one branch. Discriminative SSL instead aligns two augmented views through online and target branches. If pruning is done independently in each branch, corresponding semantic regions can be removed unevenly, weakening the SSL alignment objective.
Design
Transformer Block Integration
SimPrune is placed inside the Transformer block after the attention module and before the MLP.
SimPrune is inserted into selected Transformer blocks, such as the 4th, 7th, and 10th blocks, and works directly with standard SSL dual-branch training.
The online and target SSL branches keep the same backbone structure, while SimPrune reduces the token sequence that later layers need to process.
Cross-Branch Similarity Pruning
-
1
Match tokens across branches
For each online-branch image token, SimPrune finds the most similar target-branch token using cosine similarity.
-
2
Sort pairs by similarity
The matched token pairs are ordered from high to low similarity, forming the basis for pruning decisions.
-
3
Prune token pairs together
Tokens carrying related semantic information are retained or removed together, avoiding asymmetric pruning between views.
Difficulty-Aware Sliding Window
Early SSL training benefits from easier training, so SimPrune initially retains token pairs with higher similarity and prunes more dissimilar pairs. As training progresses, the retained window shifts toward lower-similarity pairs, gradually increasing the task difficulty and encouraging stronger representations.
Main Results
ImageNet-1k experiments use DINO with DeiT encoders for 300 epochs. Results below highlight the keep-rate 0.8 setting from the main comparison.
| Encoder | Method | Accuracy | Training FLOPs | Training Time |
|---|---|---|---|---|
| DeiT-T | DINO | 55.71 | 100% | 100% |
| EViT | 53.13 | 76% | 88% | |
| ToMe | 53.65 | 76% | 86% | |
| SimPrune (Ours) | 55.66 | 76% | 86% | |
| DeiT-S | DINO | 62.49 | 100% | 100% |
| EViT | 60.06 | 76% | 87% | |
| ToMe | 60.34 | 76% | 87% | |
| SimPrune (Ours) | 62.31 | 76% | 85% | |
| DeiT-B | DINO | 64.56 | 100% | 100% |
| EViT | 62.27 | 76% | 89% | |
| ToMe | 61.58 | 76% | 88% | |
| SimPrune (Ours) | 64.21 | 76% | 89% |
At keep rate 0.8, SimPrune reduces FLOPs by 24% and training time by about 13% on average, while keeping accuracy close to the unpruned DINO baseline.
Beyond Image Classification
SimPrune also transfers to fine-grained datasets and dense downstream tasks after SSL pretraining.
| Method | Stanford Cars Accuracy |
FGVC Aircraft Accuracy |
FLOPs |
|---|---|---|---|
| DINO | 52.87 | 55.70 | 100% |
| EViT | 50.72 | 53.81 | 76% |
| ToMe | 49.56 | 53.15 | 76% |
| SimPrune (Ours) | 52.61 | 55.48 | 76% |
Fine-grained datasets with DeiT-S and keep rate 0.8.
| Method | MS COCO APb |
ADE20K mIoU |
FLOPs |
|---|---|---|---|
| DINO | 49.8 | 34.3 | 100% |
| EViT | 48.0 | 31.2 | 76% |
| ToMe | 46.8 | 31.7 | 76% |
| SimPrune (Ours) | 49.7 | 34.0 | 76% |
Downstream object detection and semantic segmentation with DeiT-B.
Pruning Visualization