Chen Yulin's Blog

Posted 2025-02-19Updated 2025-10-16Review2 minutes read (About 273 words)

GLIP是一个学习了object-level, language-aware, and semantic-rich visual representations 的模型。
统一对象检测和短语接地进行预训练。

重要的问题

什么是 phrase grounding:
Phrase Grounding refers to the task of associating or “grounding” a natural language phrase (like a sentence or a word) to a specific region or object in an image. In other words, it’s about finding which part of the image corresponds to the object or concept described by a given text phrase.

For instance, if you have the phrase “the red ball on the table” and an image of a room with a red ball placed on a table, the goal of phrase grounding is to identify the exact region in the image that corresponds to the “red ball on the table”, distinguishing it from other objects in the image.

## Grounded Language Image Pre-training 将经典对象检测任务投入到grounding问题中，并提出**Unified Formulation**

Unified Formulation

传统的物体检测方法会把每个region分类进c个classes，而本文使用的Object detection as phrase grounding.
我们通过将每个区域与文本提示中的c(class)短语进行接地/对齐，将检测重新制定为基础任务
the classification prompt “person. bicycle. car. … . toothbrush”

Posted 2025-02-18Updated 2025-10-16Reviewa few seconds read (About 0 words)

Extract Free Dense Labels from CLIP

Posted 2025-02-17Updated 2025-10-16Review2 minutes read (About 297 words)

ConceptFusion

## Approach 目标是构建一个open-set multimodal 3D map `M`. 可以使用特定于模态的编码器（基础模型）$F_{Mode}$将图像，文本，音频和点击等多维信号编码为矢量空间其中，`M` 由一系列点构成，每个点都包含：顶点位置，法向向量，置信度数量，颜色和概念向量（concept vector）组成首先是帧（单张输入图片）预处理：通过一系列输入的深度图片获取顶点法相maps和相机方位，再通过计算获得每张图片中每个像素的语义上下文嵌入。其中，语义上下文的嵌入是通过结合局部和全局的CLIP features获得的。

然后再进行特征融合：通过相机的方位将每个帧的顶点和法相图映射到全局坐标系。对于帧$X_{t}$中的每个像素$(u，v)_t$，都在`M`中具有相应的点$P_k$

将不同帧$X_t$中的特征集合在M中特征点的公式：

Posted 2025-02-16Updated 2025-10-16Reviewa minute read (About 216 words)

Grounding-DINO

通过结合[[DINO]]和grounded-pretraining，可以使用人类输入（例如类别名称或转介表达式）检测任意对象
Open-Vocab. Det

an open-set object detector that can detect any objects with respect to an arbitrary free-form text prompt. The model was trained on over 10 million images, including detection data, visual grounding data, and image-text pairs. It has a strong zero-shot detection performance. However, the model needs text as inputs and can only detect boxes with corresponding phrases.

Grounding-DINO

Principle

Tight modality fusion based on [[DINO]]

什么是feature fusion?

- 在多模态领域，feature fusion 特指将不同模态的特征（如视觉、文本、音频等）进行融合的技术。CLIP 应该被看作是 Middle Fusion 的一种形式, 在特征提取后就进行融合对齐 #### large-scale grounded pre-train for concept generalization Reformulating **object detection** as a **phrase grounding task** and introducing **contrastive training** between object regions and language phrases on large-scale data

Posted 2025-02-16Updated 2025-10-16Reviewa few seconds read (About 17 words)

Gounded-SAM

https://github.com/IDEA-Research/Grounded-Segment-Anything

By [[Grounding-DINO]] + SAM
Achieving Open-Vocab. Det & Seg

Posted 2025-02-15Updated 2025-10-16Review6 minutes read (About 919 words)

Scene-LLM

## Intro 尽管现有的视觉语言模型（VLM）在2D视觉语言的理解中取得了长足的进步，但与使用3D表示室内场景任务的人相比，它们对持续3D空间信息的掌握有限通常会使它们的有效性较小。最近的一些文章[[3D-LLM]]以文本和其他方式桥接3D视觉信息显示出3D视觉理解和推理的潜力。但是，它们主要处理静态3D场景，这对于涉及场景变化的互动计划的适应性较低。

本文提出的模型主要想解决3D密集标注和交互式规划。
结合

egocentric（crucial for immediate updates during object interactions and for localizing the agent within the scene）
comprehensive（provides temporal persistent and multi-view consistent details of the entire 3D scene）
scene-level的信息。

需要align the dense 3D visual information with the textual embedding space of a pre-trained LLM。3D点集由于其连续坐标系以及需要适应场景状态变化的表示形式而构成了一个独特的问题

3D-VQA
VLN(Visual-Language Navigation)

3D-Visual-Language Data Generation

和[[3D-LLM]]一样，都是多视角采集D-RGB信息然后整合为3D frame
标注信息来自于Mini-GPT-V2（capable of generating captions and object descriptions from images by using caption and grounded caption identifiers）。

3D-frame

Uses image frames and a 2D-VLM(Mini-GPT-V2) to generate frame descriptions

Scene Data

3D场景数据是通过基于其相机姿势汇总的3D帧来重建
使用Llama-2-Chat-70B [65]生成场景的语言注释

prompted with a mix of context data including generated frame captions, frame object descriptions, annotated object lists, and annotated bounding boxes. These prompts lead to diverse instruction-following data types like dense caption, object caption, task decomposition, functionality enhancement, question-answering, and human-robot dialogues

From Vision Studio 对于VLM生成内容使用的self-checking: [83]

Scene-LLM

场景-LLM是一种3D视觉语言模型（VLM），具有简单而有效的体系结构，旨在理解以基于本体和场景级别的3D视觉信息，使其能够成功执行交互式计划任务。本节概述了3D视觉特征提取过程，我们的模型的体系结构，3D视觉信息与数据集的对齐以及使用Scene-LLM进行推理。

Employ visual language semantic features [51] to represent 3D visual semantics

first extracting pixel-wise CLIP features from each image and then aggregating these into a 3D point set [[ConceptFusion]]

Tokenize 3D visual features for LLM input:

hybrid point-voxel representation (need for dense 3D visual information, support for interactive updates, and manageable token lengths for the LLM)

网络大体上分为两层：

Projection layer

To bridge 3D visual tokens(F) with the LLM’s tokenized space
FC(1030, 768)->GELU->FC(768,768)

LLM

Llama-2-7b as the foundational LLM backbone

训练

Stage 1: Pretraining for Feature Alignment

在两个坐标系统（camera和世界坐标）下使用3D帧数据，以确保场景-LLM理解以自我为中心和以场景为中心的观点。
在此阶段，仅训练了projection layer，可以有效地对齐具有文本特征的3D视觉特征，同时保持LLM参数（φ）不变。

Stage 2: Finetuning

优化Scene-llm，以准确响应用户说明。我们使用标识符令牌“我看到”将3D帧语言和3D场景语言数据合并到前言。文本描述分为指令（$T_{INST}$）及其相应的响应（$T_{ANS}$）。利用转换后的3D视觉令牌（$T_{3D}$）和指令令牌（$T_{INST}$），我们的目标是微调LLM（φ）以自动生成$T_{ANS}$.
在这里，我们共同微调了投影层和LLM，由θ= {ψ，φ}表示

Posted 2025-02-13Updated 2025-10-16Review3 minutes read (About 505 words)

3D-LLM

Intro

Recent works have explored aligning images and videos with LLM for a new generation of multi-modal LLMs that equip LLMs with the ability to understand and reason about 2D images.
但是仍缺少对于3D物理空间进行分析的模型, which involves richer concepts such as spatial relationships, affordances, physics and interaction so on.

由此提出了inject the 3D world into large language models, 介绍一个全新的3D-llm模型族，可以将3D表示（即带有功能的3D点云）作为输入，并执行一系列与3D相关的任务。
优势：

关于整个场景的长期记忆可以存储在整体3D表示中，而不是情节的部分视图观测值
3D属性（如提供和空间关系）可以从3D表示形式中进行推论，远远超出了基于语言或基于2D图像的LLM的范围

挑战

数据获取：3D数据的稀缺性阻碍了基于3D的基础模型的发展。 3D数据与语言描述配对甚至更难获得
- 提出了一组独特的数据生成管道，这些管道可以生成大规模的3D数据与语言配对。
Obtain meaningful 3D features that could align with language features for 3D-LLMs: 一种方法是使用类似的对比性范式从头开始训练3D编码，以在2D图像和语言之间对齐。但是，该范式消耗了巨大的数据，时间和GPU资源。
- 使用了一个3D功能提取器，该提取器构造了渲染的多视图图像的2D预处理特征的3D功能。最近，还使用了2D预训练的CLIP特征来训练其VLMS，也有很多视觉语言模型（例如Blip-2，Flamingo）。由于我们提取的3D功能与2D预处理的功能相同，因此我们可以无缝使用2D VLM作为骨架，并输入3D功能，以进行3D-LLM的有效训练。

TODO

Posted 2025-02-13Updated 2025-10-16Reviewa few seconds read (About 0 words)

PointLLM

Posted 2025-02-13Updated 2025-10-16Reviewa few seconds read (About 0 words)

ProgPrompt

Posted 2025-01-09Updated 2025-10-16Note5 minutes read (About 722 words)

Momentum Contrast for Unsupervised Visual Representation Learning

左侧是query encoder，右侧为key encoder ## 是什么通过无监督对比学习的方法(loss:InfoNCE)来学习图像的特征。

使用的pretext task是个体判别任务

伪代码：

# f_q, f_k: encoder networks for query and key 
# queue: dictionary as a queue of K keys (CxK) 
# m: momentum 
# t: temperature  

f_k.params = f_q.params # initialize 
for x in loader: # load a minibatch x with N samples 
	x_q = aug(x) # a randomly augmented version 
	x_k = aug(x) # another randomly augmented version  
	q = f_q.forward(x_q) # queries: NxC 
	k = f_k.forward(x_k) # keys: NxC 
	k = k.detach() # no gradient to keys  
	
	# positive logits: Nx1 
	l_pos = bmm(q.view(N,1,C), k.view(N,C,1))  # 相当于把batch中每个正样本对之间求了cosine临近
	
	# negative logits: NxK 
	l_neg = mm(q.view(N,C), queue.view(C,K))  
	
	# logits: Nx(1+K) 
	logits = cat([l_pos, l_neg], dim=1)  
	
	# contrastive loss, Eqn.(1) 
		labels = zeros(N) # positives are the 0-th，将识别的类别视为0,可以直接使用CrossEntropyLoss
	loss = CrossEntropyLoss(logits/t, labels)  
	
	# SGD update: query network 
	loss.backward() 
	update(f_q.params)  
	
	# momentum update: key network 
	f_k.params = m*f_k.params+(1-m)*f_q.params  
	
	# update dictionary 
	enqueue(queue, k) # enqueue the current minibatch 
	dequeue(queue) # dequeue the earliest minibatch

亮点

Dictionary as a queue

在使用key encoder(momentum encoder)创建负样本，并把encode过的负样本存在一个queue（FIFO）中方便后续对比时直接使用，每次训练都会使用一个新的mini batch，此时会将此mini batch中的样本encode之后加入queue并删除存在最久的那个mini batch的样本（因为考虑到最老的mini batch使用的encoder是最过时的，所以FIFO是非常合理的），这样可以有效控制负样本的数量，也就是公式中的K。

节省字典的计算开销
而且mini batch大小可以直接和负样本脱钩

Momentum update

因为负样本数量（字典/队列）很大，所以没办法给key encoder回传梯度，所以可以考虑把query encoder的参数直接复制给key encoder，但过快改变的key encoder会导致样本字典的特征不一致，所以使用动量更新的方式。

> queue这个字典越大，那么理论上这个m就需要越大，保证字典中key的一致性

过往工作对比

a) 所有的样本都在一个 mini batch 里，两个encoder完全一致，也因此都可以回传梯度，keys也高度一致，但限制了字典的大小

b)
只有一个编码器进行学习。Memory bank存下了所有样本的key。每当梯度回传后，会把memory bank被本次训练中被采样过的key使用新的encoder进行更新。

缺乏特帧一致性
需要训练一阵个epoch才能更新一遍memory bank

MoCo和memory bank 更接近，但是使用了queue dictionary和momentum update