Chen Yulin's Blog

Posted 2025-03-18Updated 2025-08-15Reviewa minute read (About 197 words)

Clio= Real-time Task-Driven Open-Set 3D Scene Graphs

贡献：

The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language, and has to select the granularity and the subset of objects and scene structure to retain in its map that is sufficient to complete the tasks.
The second contribution is an algorithm for task-driven 3D scene understanding based on an Agglomerative IB approach, that is able to cluster 3D primitives in the environment into taskrelevant objects and regions
基于以上，实现了一个实时的pipeline

提出了针对不同任务需要不同粒度的语义信息，本文是通过结合SAM和[[CLIP多模态预训练模型]]实现，但是忽略了物体之间的谓语关系或者父子关系。本质还是智能做导航，拾取，放下，导航的基本操作。

Posted 2025-02-17Updated 2025-08-15Review2 minutes read (About 297 words)

ConceptFusion

## Approach 目标是构建一个open-set multimodal 3D map `M`. 可以使用特定于模态的编码器（基础模型）$F_{Mode}$将图像，文本，音频和点击等多维信号编码为矢量空间其中，`M` 由一系列点构成，每个点都包含：顶点位置，法向向量，置信度数量，颜色和概念向量（concept vector）组成首先是帧（单张输入图片）预处理：通过一系列输入的深度图片获取顶点法相maps和相机方位，再通过计算获得每张图片中每个像素的语义上下文嵌入。其中，语义上下文的嵌入是通过结合局部和全局的CLIP features获得的。

然后再进行特征融合：通过相机的方位将每个帧的顶点和法相图映射到全局坐标系。对于帧$X_{t}$中的每个像素$(u，v)_t$，都在`M`中具有相应的点$P_k$

将不同帧$X_t$中的特征集合在M中特征点的公式：

Posted 2025-02-15Updated 2025-08-15Review6 minutes read (About 919 words)

Scene-LLM

## Intro 尽管现有的视觉语言模型（VLM）在2D视觉语言的理解中取得了长足的进步，但与使用3D表示室内场景任务的人相比，它们对持续3D空间信息的掌握有限通常会使它们的有效性较小。最近的一些文章[[3D-LLM]]以文本和其他方式桥接3D视觉信息显示出3D视觉理解和推理的潜力。但是，它们主要处理静态3D场景，这对于涉及场景变化的互动计划的适应性较低。

本文提出的模型主要想解决3D密集标注和交互式规划。
结合

egocentric（crucial for immediate updates during object interactions and for localizing the agent within the scene）
comprehensive（provides temporal persistent and multi-view consistent details of the entire 3D scene）
scene-level的信息。

需要align the dense 3D visual information with the textual embedding space of a pre-trained LLM。3D点集由于其连续坐标系以及需要适应场景状态变化的表示形式而构成了一个独特的问题

3D-VQA
VLN(Visual-Language Navigation)

3D-Visual-Language Data Generation

和[[3D-LLM]]一样，都是多视角采集D-RGB信息然后整合为3D frame
标注信息来自于Mini-GPT-V2（capable of generating captions and object descriptions from images by using caption and grounded caption identifiers）。

3D-frame

Uses image frames and a 2D-VLM(Mini-GPT-V2) to generate frame descriptions

Scene Data

3D场景数据是通过基于其相机姿势汇总的3D帧来重建
使用Llama-2-Chat-70B [65]生成场景的语言注释

prompted with a mix of context data including generated frame captions, frame object descriptions, annotated object lists, and annotated bounding boxes. These prompts lead to diverse instruction-following data types like dense caption, object caption, task decomposition, functionality enhancement, question-answering, and human-robot dialogues

From Vision Studio 对于VLM生成内容使用的self-checking: [83]

Scene-LLM

场景-LLM是一种3D视觉语言模型（VLM），具有简单而有效的体系结构，旨在理解以基于本体和场景级别的3D视觉信息，使其能够成功执行交互式计划任务。本节概述了3D视觉特征提取过程，我们的模型的体系结构，3D视觉信息与数据集的对齐以及使用Scene-LLM进行推理。

Employ visual language semantic features [51] to represent 3D visual semantics

first extracting pixel-wise CLIP features from each image and then aggregating these into a 3D point set [[ConceptFusion]]

Tokenize 3D visual features for LLM input:

hybrid point-voxel representation (need for dense 3D visual information, support for interactive updates, and manageable token lengths for the LLM)

网络大体上分为两层：

Projection layer

To bridge 3D visual tokens(F) with the LLM’s tokenized space
FC(1030, 768)->GELU->FC(768,768)

LLM

Llama-2-7b as the foundational LLM backbone

训练

Stage 1: Pretraining for Feature Alignment

在两个坐标系统（camera和世界坐标）下使用3D帧数据，以确保场景-LLM理解以自我为中心和以场景为中心的观点。
在此阶段，仅训练了projection layer，可以有效地对齐具有文本特征的3D视觉特征，同时保持LLM参数（φ）不变。

Stage 2: Finetuning

优化Scene-llm，以准确响应用户说明。我们使用标识符令牌“我看到”将3D帧语言和3D场景语言数据合并到前言。文本描述分为指令（$T_{INST}$）及其相应的响应（$T_{ANS}$）。利用转换后的3D视觉令牌（$T_{3D}$）和指令令牌（$T_{INST}$），我们的目标是微调LLM（φ）以自动生成$T_{ANS}$.
在这里，我们共同微调了投影层和LLM，由θ= {ψ，φ}表示

Posted 2025-02-13Updated 2025-08-15Review3 minutes read (About 505 words)

3D-LLM

Intro

Recent works have explored aligning images and videos with LLM for a new generation of multi-modal LLMs that equip LLMs with the ability to understand and reason about 2D images.
但是仍缺少对于3D物理空间进行分析的模型, which involves richer concepts such as spatial relationships, affordances, physics and interaction so on.

由此提出了inject the 3D world into large language models, 介绍一个全新的3D-llm模型族，可以将3D表示（即带有功能的3D点云）作为输入，并执行一系列与3D相关的任务。
优势：

关于整个场景的长期记忆可以存储在整体3D表示中，而不是情节的部分视图观测值
3D属性（如提供和空间关系）可以从3D表示形式中进行推论，远远超出了基于语言或基于2D图像的LLM的范围

挑战

数据获取：3D数据的稀缺性阻碍了基于3D的基础模型的发展。 3D数据与语言描述配对甚至更难获得
- 提出了一组独特的数据生成管道，这些管道可以生成大规模的3D数据与语言配对。
Obtain meaningful 3D features that could align with language features for 3D-LLMs: 一种方法是使用类似的对比性范式从头开始训练3D编码，以在2D图像和语言之间对齐。但是，该范式消耗了巨大的数据，时间和GPU资源。
- 使用了一个3D功能提取器，该提取器构造了渲染的多视图图像的2D预处理特征的3D功能。最近，还使用了2D预训练的CLIP特征来训练其VLMS，也有很多视觉语言模型（例如Blip-2，Flamingo）。由于我们提取的3D功能与2D预处理的功能相同，因此我们可以无缝使用2D VLM作为骨架，并输入3D功能，以进行3D-LLM的有效训练。