Chen Yulin's Blog

Posted 2025-04-16Updated 2025-07-20Notea few seconds read (About 3 words)

Vision-Language Interpreter for Robot Task Planning

Posted 2025-04-16Updated 2025-07-20Reviewa few seconds read (About 3 words)

RoboEXP

Posted 2025-04-14Updated 2025-07-20Notea few seconds read (About 95 words)

(Mindmap) Part-level Scene Understanding for Robots

概念梳理

Scene Graph

A scene graph is a structural representation, which can capture detailed semantics by explicitly Modeling:

objects (‘‘man’’, ‘‘fire hydrant’’, ‘‘shorts’’)
attributes of objects (‘‘fire hydrant is yellow’’)
relations between paired objects (‘‘man jumping over fire hydrant’’)

A scene graph is a set of visual relationship triplets in the form of <subject, relation, object> or <object, is, attribute>

Scene graphs should serve as an **objective semantic representation** of the state of the scene

Posted 2025-03-25Updated 2025-07-20Note5 minutes read (About 724 words)

(Roadmap) Deeper Scene Graph For Robots

针对的问题（任务场景）

Robotic planning and execution in open-world environments is a complex problem due to the vast state spaces and high variability of task embodiment.
例如针对家用场景：

OVMM Challenge: https://aihabitat.org/challenge/2023_homerobot_ovmm/
想要在这样复杂场景中执行 general, long-horizon, embodied tasks 需要生成一系列离散的动作，这些动作在都拥有累计和传播错误的可能。因此需要创建一个可行的计划并在该计划出现问题时恢复，需要对物理环境进行有效的抽象以及能够完全利用该抽象的planner。应对这些挑战需要整合自然语言理解，多粒度的场景抽象和理解以及有弹性的推理。

目前粗粒度（object-level）的场景抽象（场景图构建）已经有许多工作了，详见Reconstruct-Anything Literature Review，在这些工作中，重点都在于object detection和 object-level visual relationship detection

需要聚焦的部分是多粒度的场景抽象
需要多粒度的原因：

Scalability: 如果只有一个粒度，那么输入LLM的场景图token不可控，影响扩展性
想要和物体进行更复杂的交互（相较于抓取），需要明确物体各个part的位置，语义性质，和父物体的parent-child relationship。这就要求场景图的生成需要考虑更细粒度。
针对不同复杂度的物体，需要的物体粒度层级不同
对于不同任务，需要的物体粒度也不同。
具体案例（任务需要的颗粒度层次）：
<Task>给水壶加水：
- <object-level>水壶
  - <part-level>壶盖
  - <part-level>把手
- <object-level>饮水机
  - <part-level>操作面板
    - <part-level>绿色按钮（常温水）
    - <part-level>红色按钮（开水）
    - <part-level>童锁
  - <part-level>水槽
- <object-level>桌子
  - <part-level>桌面
<Task>离开房间
- <object-level>门
  - <part-level>把手
  - <part-level>纸条：“离开房间前把玩偶放回红筐”
- <object-level>黄鸭玩偶
- <object-level>红框

在更细粒度（part-level）的场景抽象中，重点在于子物体和父物体关系的识别

除此，和object-level scene graph中的object detection相对的，是part-level scene graph的子物体语义的多粒度分割和语义信息提取，可以由现有的Semantic-SAM和类似CLIP或者其他多模态模型的语义特征提取器实现。

主要的研究流程

明确研究对象Parent-child Relationship

What aspects does parent-child relationship include?

语义构成关系，即这个子物体的存在与否给父物体的语义带来了什么改变 Translation in embedding space.
kinematic relations，也就是需要把一个物体以一个运动学树的形式构建出来

项目流程的流程

自监督的特征提取方法

Posted 2025-03-18Updated 2025-07-20Reviewa few seconds read (About 3 words)

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Posted 2025-03-12Updated 2025-07-20Review10 minutes read (About 1524 words)

Reconstruct Anything Literature Review

涉及的文章：

相近工作
- [[Part-level Scene Reconstruction Affords Robot Interaction]]
- [[Scene Reconstruction with Functional Objects for Robot Autonomy]]
- [[Reasoning with Scene Graphs for Robot Planning under Partial Observability]]
- [[ACDC- Automated Creation of Digital Cousins for Robust Policy Learning]]
- [[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]]
- [[Factorizable Net= An Efficient Subgraph-based Framework for Scene Graph Generation]]
- [[SayPlan= Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning]]
- [[ConceptGraphs= Open-Vocabulary 3D Scene Graphs for Perception and Planning]]
数据生成
- [[PHYSCENE- Physically Interactable 3D Scene Synthesis for Embodied AI]]
数据集
- CLEVR
- Visual Genome
其他
- [[SceneGraphFusion- Incremental 3D Scene Graph Prediction from RGB-D Sequences]]
- [[Visual Relationship Detection with Language Priors]]
- [[Image generation from scene graphs]]
- [[(FCSGG) Fully Convolutional Scene Graph Generation]]
- [[RelTR= Relation Transformer for Scene Graph Generation]]
- [[Scene Graph Generation by Iterative Message Passing]]
- [[From Pixels to Graphs= Open-Vocabulary Scene Graph Generation with Vision-Language Models]]
- [[(VtransE) Visual Translation Embedding Network for Visual Relation Detection]]
- [[(UVtransE) Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation]]
- [[(RLSV) Representation Learning for Scene Graph Completion via Jointly Structural and Visual Embedding]]
- [[Energy-Based Learning for Scene Graph Generation]]

研究目标

通过构建part-level scene-graph，结合Reasoning with LLM 让机器人能够实现更复杂的交互，并以此完成更复杂的任务。

Scene Graph Introduction

Background

Visual scene understanding长期以来一直被认为是计算机视觉的圣杯

Rapid scene understanding at all levels

Generally

Visual scene understanding 可以被分为两块任务

recognition task
- image level
  - image classification
    - [[DINO]]
    - [[CLIP多模态预训练模型]]
- pixel level
  - semantic segmentation: classify each pixel in an image into a category
    - Mask RCNN
    - U-Net
- instance level
  - instance segmentation: detect and delineate each individual object instance in an image (bounding boxes or segmentation masks)
    - [[Grounding-DINO]]
    - [[Gounded-SAM]]
- pixel & instance level
  - [[Panoptic Segmentation]]: takes into account both per-pixel class and instance labels
    - [[MaskDINO]]
    - [[Semantic-SAM]]
application task
- …

Relation & Interaction

但是以上这些Generally的工作注重的都是the localization of objects，更高级别的任务强调探索对象之间的丰富语义关系，以及对象与周围环境的相互作用

视觉关系检测（VRD）
- [[GPS-Net= Graph Property Sensing Network for Scene Graph Generation]]
- [[Large-scale visual relationship understanding]]
人类对象相互作用（HOI）
- …

CV & NLP

除此之外还有将NLP和CV结合起来的方向，主要是一些VLM

image caption
visual question answering
visual dialog

Structured Representation of Scene (Scene Graph)

对于总体场景的感知和信息的有效表示仍然是瓶颈。
所以Li Feifei 在[[Image Retrieval using Scene Graphs]]提出Scene Graph

与Structured Representation相对的是Latent Representation

Scene Graph Definition

A scene graph is a structural representation, which can capture detailed semantics by explicitly Modeling

objects (‘‘man’’, ‘‘fire hydrant’’, ‘‘shorts’’)
attributes of objects (‘‘fire hydrant is yellow’’)
relations between paired objects (‘‘man jumping over fire hydrant’’)

A scene graph is a set of visual relationship triplets in the form of <subject, relation, object> or <object, is, attribute>

Scene graphs should serve as an objective semantic representation of the state of the scene

为什么选择scene graph

Scene Graph具有应对和改善其他视觉任务的内在潜力。
可以解决的视觉任务包括：

Image captioning
- take an image as an input and parse it into a scene graph, and then generate a reasonable text as output.
Visual question answering
Content-based image retrieval
Image generation
- extracting scene graphs from the text description and then generate realistic images
  - [[Image generation from scene graphs]]
referring expression comprehension

Scene Graph Generation

场景图生成的目的是解析图像或一系列图像，并且生成结构化表示，以此弥合视觉和语义感知之间的差距，并最终达到对视觉场景的完整理解。
任务的本质是检测视觉关系。

先驱工作

早先由Feifei [[Visual Relationship Detection with Language Priors]] 提出了视觉关系检测的方法。
以及Visual Genome这个包含物体关系的数据集

生成方法

Two-stage

Detects objects first and then solves a classification task to determine the relationship between each pair of objects

**General:** a) 通过图片获取 subject/object and union box proposals (ROI感兴趣区域)

b) 提取每个区域的特征。包括object的appearance, spatial information, label, depth, and mask；predicate的appearance, spatial, depth, and mask。

Fast/Faster R-CNN

c) 这些多模态特征被 vectorized, combined, and refined。可以通过：

message passing mechanisms
- [[Scene Graph Generation by Iterative Message Passing]]
attention mechanisms
visual translation embedding

d) 分类器用于预测predicate的类别

基于Visual translation embedding的

Translation between Subject and Object (subject+predicate ≈ object)
- [[(VtransE) Visual Translation Embedding Network for Visual Relation Detection]]
Translation among Subject, Object and Predicate
- [[(UVtransE) Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation]]
- [[(RLSV) Representation Learning for Scene Graph Completion via Jointly Structural and Visual Embedding]]

One-stage!!!

Simultaneously detects and recognizes objects and relations
相较于two-stage:

需要更少的计算资源和参数
不会受到object detection的质量影响
Example:
[[(FCSGG) Fully Convolutional Scene Graph Generation]] (bottom-up + RAF)
[[RelTR= Relation Transformer for Scene Graph Generation]] (bottom-up)
[[SGTR= End-to-end Scene Graph Generation with Transformer]] (top-down)

Open-Vocabulary

基本都是基于LLM或者VLM之类的大模型

[[From Pixels to Graphs= Open-Vocabulary Scene Graph Generation with Vision-Language Models]]
[[ConceptGraphs= Open-Vocabulary 3D Scene Graphs for Perception and Planning]]

Scene Graph小结

这里所有的工作都是关于如何判断两个独立物体之间的谓语关系（例如riding, holding…），并没有涉及part-level relationship的工作。part-level的父子关系和object-level的谓语关系是很不一样的。

不基于Scene Graph 的场景理解方法

隐式场景

即场景信息存储在一个神经网络中，并没有显式的结构，规划器（可以是LLM）通过query这个模型来获得信息。

[[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]]
- 无结构化，只提供语义查询，定位

数字表亲场景

核心思想是用交互更丰富的模型组合成可交互的替代场景。

[[Scene Reconstruction with Functional Objects for Robot Autonomy]]
[[ACDC- Automated Creation of Digital Cousins for Robust Policy Learning]]
- 通既有的精细模型库来拟合场景中的物体，可以实现更丰富的交互，对家常物品zero-shot，但是精度有限，不能应对复杂物体

Contact Graph (可以认为是Scene Graph的扩展)

主要用于建模物体之间的运动学关系

[[Scene Reconstruction with Functional Objects for Robot Autonomy]]
[[Part-level Scene Reconstruction Affords Robot Interaction]]
- 这个涉及到了父子之间的运动学关系

Scene Graph & Robots

将 scene graph 用于机器人任务理解和规划

[[SayPlan= Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning]]
[[Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation]]
[[ConceptGraphs= Open-Vocabulary 3D Scene Graphs for Perception and Planning]]

Posted 2025-02-15Updated 2025-07-20Review6 minutes read (About 919 words)

Scene-LLM

## Intro 尽管现有的视觉语言模型（VLM）在2D视觉语言的理解中取得了长足的进步，但与使用3D表示室内场景任务的人相比，它们对持续3D空间信息的掌握有限通常会使它们的有效性较小。最近的一些文章[[3D-LLM]]以文本和其他方式桥接3D视觉信息显示出3D视觉理解和推理的潜力。但是，它们主要处理静态3D场景，这对于涉及场景变化的互动计划的适应性较低。

本文提出的模型主要想解决3D密集标注和交互式规划。
结合

egocentric（crucial for immediate updates during object interactions and for localizing the agent within the scene）
comprehensive（provides temporal persistent and multi-view consistent details of the entire 3D scene）
scene-level的信息。

需要align the dense 3D visual information with the textual embedding space of a pre-trained LLM。3D点集由于其连续坐标系以及需要适应场景状态变化的表示形式而构成了一个独特的问题

3D-VQA
VLN(Visual-Language Navigation)

3D-Visual-Language Data Generation

和[[3D-LLM]]一样，都是多视角采集D-RGB信息然后整合为3D frame
标注信息来自于Mini-GPT-V2（capable of generating captions and object descriptions from images by using caption and grounded caption identifiers）。

3D-frame

Uses image frames and a 2D-VLM(Mini-GPT-V2) to generate frame descriptions

Scene Data

3D场景数据是通过基于其相机姿势汇总的3D帧来重建
使用Llama-2-Chat-70B [65]生成场景的语言注释

prompted with a mix of context data including generated frame captions, frame object descriptions, annotated object lists, and annotated bounding boxes. These prompts lead to diverse instruction-following data types like dense caption, object caption, task decomposition, functionality enhancement, question-answering, and human-robot dialogues

From Vision Studio 对于VLM生成内容使用的self-checking: [83]

Scene-LLM

场景-LLM是一种3D视觉语言模型（VLM），具有简单而有效的体系结构，旨在理解以基于本体和场景级别的3D视觉信息，使其能够成功执行交互式计划任务。本节概述了3D视觉特征提取过程，我们的模型的体系结构，3D视觉信息与数据集的对齐以及使用Scene-LLM进行推理。

Employ visual language semantic features [51] to represent 3D visual semantics

first extracting pixel-wise CLIP features from each image and then aggregating these into a 3D point set [[ConceptFusion]]

Tokenize 3D visual features for LLM input:

hybrid point-voxel representation (need for dense 3D visual information, support for interactive updates, and manageable token lengths for the LLM)

网络大体上分为两层：

Projection layer

To bridge 3D visual tokens(F) with the LLM’s tokenized space
FC(1030, 768)->GELU->FC(768,768)

LLM

Llama-2-7b as the foundational LLM backbone

训练

Stage 1: Pretraining for Feature Alignment

在两个坐标系统（camera和世界坐标）下使用3D帧数据，以确保场景-LLM理解以自我为中心和以场景为中心的观点。
在此阶段，仅训练了projection layer，可以有效地对齐具有文本特征的3D视觉特征，同时保持LLM参数（φ）不变。

Stage 2: Finetuning

优化Scene-llm，以准确响应用户说明。我们使用标识符令牌“我看到”将3D帧语言和3D场景语言数据合并到前言。文本描述分为指令（$T_{INST}$）及其相应的响应（$T_{ANS}$）。利用转换后的3D视觉令牌（$T_{3D}$）和指令令牌（$T_{INST}$），我们的目标是微调LLM（φ）以自动生成$T_{ANS}$.
在这里，我们共同微调了投影层和LLM，由θ= {ψ，φ}表示

Posted 2025-02-13Updated 2025-07-20Review3 minutes read (About 505 words)

3D-LLM

Intro

Recent works have explored aligning images and videos with LLM for a new generation of multi-modal LLMs that equip LLMs with the ability to understand and reason about 2D images.
但是仍缺少对于3D物理空间进行分析的模型, which involves richer concepts such as spatial relationships, affordances, physics and interaction so on.

由此提出了inject the 3D world into large language models, 介绍一个全新的3D-llm模型族，可以将3D表示（即带有功能的3D点云）作为输入，并执行一系列与3D相关的任务。
优势：

关于整个场景的长期记忆可以存储在整体3D表示中，而不是情节的部分视图观测值
3D属性（如提供和空间关系）可以从3D表示形式中进行推论，远远超出了基于语言或基于2D图像的LLM的范围

挑战

数据获取：3D数据的稀缺性阻碍了基于3D的基础模型的发展。 3D数据与语言描述配对甚至更难获得
- 提出了一组独特的数据生成管道，这些管道可以生成大规模的3D数据与语言配对。
Obtain meaningful 3D features that could align with language features for 3D-LLMs: 一种方法是使用类似的对比性范式从头开始训练3D编码，以在2D图像和语言之间对齐。但是，该范式消耗了巨大的数据，时间和GPU资源。
- 使用了一个3D功能提取器，该提取器构造了渲染的多视图图像的2D预处理特征的3D功能。最近，还使用了2D预训练的CLIP特征来训练其VLMS，也有很多视觉语言模型（例如Blip-2，Flamingo）。由于我们提取的3D功能与2D预处理的功能相同，因此我们可以无缝使用2D VLM作为骨架，并输入3D功能，以进行3D-LLM的有效训练。

TODO

Posted 2025-01-06Updated 2025-07-20Note6 minutes read (About 959 words)

OK-Robot- What Really Matters in Integrating Open-Knowledge Models for Robotics

Intro

Creating a general-purpose robot has been a longstanding dream of the robotics community.

背景

当前想要实现这一目标的系统脆弱、封闭，并且在遇到未见过的情况时会失败。即使是最大的机器人模型通常也只能部署在以前见过的环境中 [5, 6]。在机器人数据很少的环境中，例如在非结构化的家庭环境中，这些系统的脆弱性会进一步加剧。

虽然大型视觉模型显示出语义理解、检测以及将视觉表示与语言联系起来的能力并且与此同时，机器人的导航、抓取和重新排列等基本机器人技能已经相当成熟。
但是将现代视觉模型与机器人特定基元相结合的机器人系统表现非常差。

这可能是因为单纯将多个不确定性的系统组合在一起会导致准确率急剧恶化。
所以我们需要一个将VLM和机器人primitives(导航，抓取，放置)结合在一起的细致框架，即OK-Robot。

发现

预训练的 VLM 对于开放词汇导航非常有效: 当前的开放词汇视觉语言模型，例如 CLIP 或 OWL-ViT，在识别现实世界中的任意对象方面提供了强大的性能，并能够以零样本的方式导航到它们。
预训练的抓取模型可以直接应用于移动操作：与 VLM 类似，经过大量数据预训练的专用机器人模型可以立即应用于家庭中的开放词汇抓取。这些机器人模型不需要任何额外的训练或微调。
如何组合组件至关重要：给定预训练模型，我们发现可以使用简单的状态机模型将它们组合在一起，无需训练。我们还发现，使用启发式方法来抵消机器人的物理限制可以在现实世界中获得更高的成功率。
仍然存在一些挑战：虽然，考虑到在任意家庭中进行零样本的巨大挑战，OK-Robot 在之前的工作基础上进行了改进，通过分析故障模式，我们发现 VLM、机器人模型和机器人形态可以进行重大改进，这将直接提高开放知识操纵代理的性能。

Methodology

该框架主要完成的任务

Pick up A (from B) and drop it on/in C”, where A is an object and B and C are places in a real-world environment such as homes

负责空间重建，识别物体大致位置，机器人导航
用到的方法:

CLIP-Fields [[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]] : a RGB-D video of the home -> a sequence of posed ( with camera pose and positions) RGB-D images，用于重建环境，该研究还基于此获取了环境中物体和容器旁边的地板表面。
OWL-ViT [[Simple Open-Vocabulary Object Detection with Vision Transformers]] : 我们在每一帧上应用检测器，并提取每个对象边界框、CLIP-embedding、检测器置信度，并将这些信息传递到object memory模块中
SAM: 用于将ViT的检测框转化为mask
VoxcelMap: similar to object-centric memory of CLIP-Fields [[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]], 基于点云中每一个点的CLIP semantic vector,每一个5cm的体素都包含一个CLIP-embedding的detector-confidence weighted average.
Querying the memory module: 先将language query 转化成CLIP semantic vector,然后基于voxelmap的clip-embeding，寻找最语义接近的那个voxel，以此定位。

Experiment

Posted 2024-12-17Updated 2025-07-20Reviewa few seconds read (About 42 words)

Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation

和我的想法非常相近，完成度也很高啊喂。可以参考他的实现思路，引用的文章等等。

## Intro

概念梳理

Scene Graph

针对的问题（任务场景）

主要的研究流程

明确研究对象Parent-child Relationship

项目流程的流程

自监督的特征提取方法

研究目标

Scene Graph Introduction

Background

Rapid scene understanding at all levels

Generally

Relation & Interaction

CV & NLP

Structured Representation of Scene (Scene Graph)

Scene Graph Definition

为什么选择scene graph

Scene Graph Generation

先驱工作

生成方法

Two-stage

One-stage!!!

Open-Vocabulary

Scene Graph小结

不基于Scene Graph 的场景理解方法

隐式场景

数字表亲场景

Contact Graph (可以认为是Scene Graph的扩展)

Scene Graph & Robots

Related Works

3D-Visual-Language Data Generation

3D-frame

Scene Data

Scene-LLM

网络大体上分为两层：

Projection layer

LLM

训练

Stage 1: Pretraining for Feature Alignment

Stage 2: Finetuning

Intro

TODO

Intro

背景

发现

Methodology

该框架主要完成的任务

Open-home, open-vocabulary object navigation

Experiment

Archives

Recents

Tags