Momentum Contrast for Unsupervised Visual Representation Learning

Vision Transformers Need Registers

AN IMAGE IS WORTH 16X16 WORDS- TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

DINO

CLIP

LERF- Language Embedded Radiance Fields

Some Thoughts Regarding -Reconstruct Anything-

CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory