Chen Yulin's Blog

Posted 2025-04-16Updated 2026-06-22Reviewa few seconds read (About 3 words)

RoboEXP

Posted 2025-03-19Updated 2026-06-22Reviewa few seconds read (About 42 words)

ConceptGraphs= Open-Vocabulary 3D Scene Graphs for Perception and Planning

通过LLM来判断位置关系，以此构建scene graph

还是只能判断object-level空间关系，做不了part-level manipulation

Posted 2025-03-18Updated 2026-06-22Review2 minutes read (About 355 words)

SayPlan= Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning

主要的思想都在上面这个伪代码里，通过只展开部分场景图（严格层级结构），来控制输入llm的场景图大小。

A scalable approach to ground LLM-based task planners across environments spanning multiple rooms and floors

Scene Graph 通过networkx (python package)表示

Three key innovations:

通过Collapsed 3DSG来在少数根节点上寻找task-relevant子图（后续通过展开子图进行进一步的搜寻），提高了scalability（避免过于复杂的整体场景图超过LLM的token限制）
环境中任务计划的horizon会随着给定任务的复杂性而增长，LLM会倾向于产生幻觉或者不可行的动作序列。所以通过成熟的path planner such as Dijkstra来连接high-level nodes。
An iterative replanning pipeline in order to correct for any unexecutable actions
- Missing to open the fridge before putting something into it
  因此，避免由于环境本身的物理限制和谓词的矛盾，幻觉或不一致而导致的计划失败。

Insight

每一次场景节点的展开与否，该节点是否是任务关注的节点都是由LLM决定的，这一点和我的想法一致。等于是将LLM作为一个检查器一层层遍历查找任务的兴趣点。
Scene Graph Simulator作为任务是否可行的验证器。

Posted 2025-03-18Updated 2026-06-22Reviewa minute read (About 197 words)

Clio= Real-time Task-Driven Open-Set 3D Scene Graphs

贡献：

The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language, and has to select the granularity and the subset of objects and scene structure to retain in its map that is sufficient to complete the tasks.
The second contribution is an algorithm for task-driven 3D scene understanding based on an Agglomerative IB approach, that is able to cluster 3D primitives in the environment into taskrelevant objects and regions
基于以上，实现了一个实时的pipeline

提出了针对不同任务需要不同粒度的语义信息，本文是通过结合SAM和[[CLIP多模态预训练模型]]实现，但是忽略了物体之间的谓语关系或者父子关系。本质还是智能做导航，拾取，放下，导航的基本操作。

Posted 2025-03-18Updated 2026-06-22Reviewa few seconds read (About 3 words)

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Posted 2025-03-12Updated 2026-06-22Reviewa few seconds read (About 94 words)

SceneGraphFusion- Incremental 3D Scene Graph Predictionfrom RGB-D Sequences

Overview of the proposed SceneGraphFusion framework. Our method takes a stream of RGB-D images a) as input to create an incremental geometric segmentation b). Then, the properties of each segment and a neighbor graph between segments are constructed. The properties d) and neighbor graph e) of the segments that have been updated in the current frame c) are used as the inputs to compute node and edge features f) and to predict a 3D scene graph g). Finally, the predictions are h) fused back into a globally consistent 3D graph.

Posted 2025-03-11Updated 2026-06-22Reviewa few seconds read (About 3 words)

PHYSCENE- Physically Interactable 3D Scene Synthesis for Embodied AI

Posted 2025-03-11Updated 2026-06-22Reviewa few seconds read (About 23 words)

Scene Reconstruction with Functional Objects for Robot Autonomy

和李飞飞[[ACDC- Automated Creation of Digital Cousins for Robust Policy Learning]]的思想类似。

Posted 2025-03-11Updated 2026-06-22Reviewa few seconds read (About 3 words)

Part-level Scene Reconstruction Affords Robot Interaction

Posted 2025-02-17Updated 2026-06-22Review2 minutes read (About 297 words)

ConceptFusion

## Approach 目标是构建一个open-set multimodal 3D map `M`. 可以使用特定于模态的编码器（基础模型）$F_{Mode}$将图像，文本，音频和点击等多维信号编码为矢量空间其中，`M` 由一系列点构成，每个点都包含：顶点位置，法向向量，置信度数量，颜色和概念向量（concept vector）组成首先是帧（单张输入图片）预处理：通过一系列输入的深度图片获取顶点法相maps和相机方位，再通过计算获得每张图片中每个像素的语义上下文嵌入。其中，语义上下文的嵌入是通过结合局部和全局的CLIP features获得的。