Chen Yulin's Blog

Posted 2025-02-16Updated 2026-06-22Reviewa minute read (About 216 words)

通过结合[[DINO]]和grounded-pretraining，可以使用人类输入（例如类别名称或转介表达式）检测任意对象
Open-Vocab. Det

an open-set object detector that can detect any objects with respect to an arbitrary free-form text prompt. The model was trained on over 10 million images, including detection data, visual grounding data, and image-text pairs. It has a strong zero-shot detection performance. However, the model needs text as inputs and can only detect boxes with corresponding phrases.

Grounding-DINO

Principle

Tight modality fusion based on [[DINO]]

什么是feature fusion?

- 在多模态领域，feature fusion 特指将不同模态的特征（如视觉、文本、音频等）进行融合的技术。CLIP 应该被看作是 Middle Fusion 的一种形式, 在特征提取后就进行融合对齐 #### large-scale grounded pre-train for concept generalization Reformulating **object detection** as a **phrase grounding task** and introducing **contrastive training** between object regions and language phrases on large-scale data

Posted 2025-02-16Updated 2026-06-22Reviewa few seconds read (About 17 words)

Gounded-SAM

https://github.com/IDEA-Research/Grounded-Segment-Anything

By [[Grounding-DINO]] + SAM
Achieving Open-Vocab. Det & Seg

Posted 2025-01-09Updated 2026-06-22Note5 minutes read (About 722 words)

Momentum Contrast for Unsupervised Visual Representation Learning

左侧是query encoder，右侧为key encoder ## 是什么通过无监督对比学习的方法(loss:InfoNCE)来学习图像的特征。

使用的pretext task是个体判别任务

伪代码：

# f_q, f_k: encoder networks for query and key 
# queue: dictionary as a queue of K keys (CxK) 
# m: momentum 
# t: temperature  

f_k.params = f_q.params # initialize 
for x in loader: # load a minibatch x with N samples 
	x_q = aug(x) # a randomly augmented version 
	x_k = aug(x) # another randomly augmented version  
	q = f_q.forward(x_q) # queries: NxC 
	k = f_k.forward(x_k) # keys: NxC 
	k = k.detach() # no gradient to keys  
	
	# positive logits: Nx1 
	l_pos = bmm(q.view(N,1,C), k.view(N,C,1))  # 相当于把batch中每个正样本对之间求了cosine临近
	
	# negative logits: NxK 
	l_neg = mm(q.view(N,C), queue.view(C,K))  
	
	# logits: Nx(1+K) 
	logits = cat([l_pos, l_neg], dim=1)  
	
	# contrastive loss, Eqn.(1) 
		labels = zeros(N) # positives are the 0-th，将识别的类别视为0,可以直接使用CrossEntropyLoss
	loss = CrossEntropyLoss(logits/t, labels)  
	
	# SGD update: query network 
	loss.backward() 
	update(f_q.params)  
	
	# momentum update: key network 
	f_k.params = m*f_k.params+(1-m)*f_q.params  
	
	# update dictionary 
	enqueue(queue, k) # enqueue the current minibatch 
	dequeue(queue) # dequeue the earliest minibatch

亮点

Dictionary as a queue

在使用key encoder(momentum encoder)创建负样本，并把encode过的负样本存在一个queue（FIFO）中方便后续对比时直接使用，每次训练都会使用一个新的mini batch，此时会将此mini batch中的样本encode之后加入queue并删除存在最久的那个mini batch的样本（因为考虑到最老的mini batch使用的encoder是最过时的，所以FIFO是非常合理的），这样可以有效控制负样本的数量，也就是公式中的K。

节省字典的计算开销
而且mini batch大小可以直接和负样本脱钩

Momentum update

因为负样本数量（字典/队列）很大，所以没办法给key encoder回传梯度，所以可以考虑把query encoder的参数直接复制给key encoder，但过快改变的key encoder会导致样本字典的特征不一致，所以使用动量更新的方式。

> queue这个字典越大，那么理论上这个m就需要越大，保证字典中key的一致性

过往工作对比

a) 所有的样本都在一个 mini batch 里，两个encoder完全一致，也因此都可以回传梯度，keys也高度一致，但限制了字典的大小

b)
只有一个编码器进行学习。Memory bank存下了所有样本的key。每当梯度回传后，会把memory bank被本次训练中被采样过的key使用新的encoder进行更新。

缺乏特帧一致性
需要训练一阵个epoch才能更新一遍memory bank

MoCo和memory bank 更接近，但是使用了queue dictionary和momentum update

Posted 2025-01-09Updated 2026-06-22Notea few seconds read (About 0 words)

Vision Transformers Need Registers

Posted 2025-01-09Updated 2026-06-22Notea few seconds read (About 0 words)

DINOv2- Learning Robust Visual Features without Supervision

Posted 2025-01-09Updated 2026-06-22Notea few seconds read (About 71 words)

AN IMAGE IS WORTH 16X16 WORDS- TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

https://www.youtube.com/watch?v=j3VNqtJUoz0&t=16s

核心思想：

将图像分为patches, 线性映射, 再加上图片的position embeding来输入transformer encoder
额外使用一个cls token用于占位（ViT的输出就是这个cls input token对应的output token）

Posted 2025-01-08Updated 2026-06-22Note4 minutes read (About 561 words)

DINO

https://github.com/facebookresearch/dino/tree/main

# Emerging Properties in Self-Supervised Vision Transformers

https://juejin.cn/post/7224738994825789496
https://www.youtube.com/watch?v=h3ij3F3cPIk&t=1005s
DI+NO（蒸馏+No Label）
具体来说，DINO 是使用一种称为“无监督自蒸馏”的方法，该方法通过自监督学习来学习模型的知识表示。在这个方法中，模型使用自身的输出来生成“伪标签”，然后使用这些伪标签来重新训练模型，从而进一步提高模型的性能和泛化能力。

知识蒸馏

https://blog.csdn.net/xbinworld/article/details/83063726

重点idea就是提出用soft target来辅助hard target一起训练，而soft target来自于大模型的预测输出。这里有人会问，明明true label（hard target）是完全正确的，为什么还要soft target呢？
hard target 包含的信息量（信息熵）很低，soft target包含的信息量大，拥有不同类之间关系的信息（比如同时分类驴和马的时候，尽管某张图片是马，但是soft target就不会像hard target 那样只有马的index处的值为1，其余为0，而是在驴的部分也会有概率。）[5]
这样的好处是，这个图像可能更像驴，而不会去像汽车或者狗之类的，而这样的soft信息存在于概率中，以及label之间的高低相似性都存在于soft target中。但是如果soft targe是像这样的信息[0.98 0.01 0.01]，就意义不大了，所以需要在softmax中增加温度参数T（这个设置在最终训练完之后的推理中是不需要的）

ViT

DINO

总的来说DINO最适合的任务就是将不同状态的同一物体进行归类。

关于DINO中发生的涌现
https://juejin.cn/post/7280436457142501388

DINO之前的工作

We have also seen emerged two properties that can be leveraged in future applications: the quality of the features in k-NN classification has a potential for image retrieval. The presence of information about the scene layout in the features can also benefit weakly supervised image segmentation.

Posted 2025-01-06Updated 2026-06-22Notea minute read (About 197 words)

CLIP

https://blog.csdn.net/h661975/article/details/135116957

loss: ITC (Image Text Contrastive)

# image_encoder - ResNet or Vision Transformer 
# text_encoder - CBOW or Text Transformer 
# I[n, h, w, c] - minibatch of aligned images 
# T[n, l] - minibatch of aligned texts 
# W_i[d_i, d_e] - learned proj of image to embed 
# W_t[d_t, d_e] - learned proj of text to embed 
# t - learned temperature parameter  

# extract feature representations of each modality 
I_f = image_encoder(I) #[n, d_i] 
T_f = text_encoder(T) #[n, d_t]  

# joint multimodal embedding [n, d_e] 
I_e = l2_normalize(np.dot(I_f, W_i), axis=1) T
_e = l2_normalize(np.dot(T_f, W_t), axis=1)  

# scaled pairwise cosine similarities [n, n] 
logits = np.dot(I_e, T_e.T) * np.exp(t)  

# symmetric loss function 
labels = np.arange(n) 
loss_i = cross_entropy_loss(logits, labels, axis=0) 
loss_t = cross_entropy_loss(logits, labels, axis=1) 
loss = (loss_i + loss_t)/2

Cross_entropy_loss:

CLIP 本质上是全局图像嵌入，不利于像素对齐特征提取。

Posted 2025-01-06Updated 2026-06-22Note5 minutes read (About 790 words)

LERF- Language Embedded Radiance Fields

NeRF+CLIP

Intro

背景

神经辐射场 (NeRF) 已成为一种强大的技术，用于捕获复杂的现实世界 3D 场景的逼真数字表示。然而，NeRF 的直接输出只不过是一个彩色的密度场，缺乏意义或上下文，这阻碍了构建与生成的 3D 场景交互的界面。
自然语言是与 3D 场景交互的直观界面。考虑厨房的捕获。想象一下，能够通过询问“用具”在哪里来导航这个厨房，或者更具体地说，询问可用于“搅拌”的工具，甚至可以询问您最喜欢的带有特定功能的杯子。其上的徽标——贯穿日常对话的舒适和熟悉。这不仅需要处理自然语言输入查询的能力，还需要能够在多个尺度上合并语义并与长尾和抽象概念相关。

解决方案

一个Language Field
通过优化从现成的视觉语言模型（如 CLIP）到 3D 场景的嵌入，为 NeRF 中的语言奠定基础。
LERF 提供了一个额外的好处：由于我们从多个尺度的多个视图中提取 CLIP 嵌入，因此通过 3D CLIP 嵌入获得的文本查询的相关性图与通过 2D CLIP 嵌入获得的文本查询的相关性图相比更加本地化。根据定义，它们也是 3D 一致的，可以直接在 3D 字段中进行查询，而无需渲染到多个视图。

相较于Clip-Field[[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]], LERF 更密集。

CLIP-Fields [32] and NLMaps-SayCan [8] fuse CLIP embeddings of crops into pointclouds, using a contrastively supervised field and classical pointcloud fusion respectively. In CLIP-Fields, the crop locations are guided by Detic [40]. On the other hand, NLMaps-SayCan relies on region proposal networks. These maps are sparser than LERF as they primarily query CLIP on detected objects rather than densely throughout views of the scene. Concurrent work ConceptFusion [19] fuses CLIP features more densely in RGBD pointclouds, using Mask2Former [9] to predict regions of interest, meaning it can lose objects which are out of distribution to Mask2Former’s training set. In contrast, LERF does not use region or mask proposals.