Chen Yulin's Blog

Posted 2025-03-19Updated 2025-07-24Reviewa few seconds read (About 42 words)

ConceptGraphs= Open-Vocabulary 3D Scene Graphs for Perception and Planning

通过LLM来判断位置关系，以此构建scene graph

还是只能判断object-level空间关系，做不了part-level manipulation

Posted 2025-03-18Updated 2025-07-24Review2 minutes read (About 355 words)

SayPlan= Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning

主要的思想都在上面这个伪代码里，通过只展开部分场景图（严格层级结构），来控制输入llm的场景图大小。

A scalable approach to ground LLM-based task planners across environments spanning multiple rooms and floors

Scene Graph 通过networkx (python package)表示

Three key innovations:

通过Collapsed 3DSG来在少数根节点上寻找task-relevant子图（后续通过展开子图进行进一步的搜寻），提高了scalability（避免过于复杂的整体场景图超过LLM的token限制）
环境中任务计划的horizon会随着给定任务的复杂性而增长，LLM会倾向于产生幻觉或者不可行的动作序列。所以通过成熟的path planner such as Dijkstra来连接high-level nodes。
An iterative replanning pipeline in order to correct for any unexecutable actions
- Missing to open the fridge before putting something into it
  因此，避免由于环境本身的物理限制和谓词的矛盾，幻觉或不一致而导致的计划失败。

Insight

每一次场景节点的展开与否，该节点是否是任务关注的节点都是由LLM决定的，这一点和我的想法一致。等于是将LLM作为一个检查器一层层遍历查找任务的兴趣点。
Scene Graph Simulator作为任务是否可行的验证器。

Posted 2025-03-18Updated 2025-07-24Reviewa minute read (About 197 words)

Clio= Real-time Task-Driven Open-Set 3D Scene Graphs

贡献：

The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language, and has to select the granularity and the subset of objects and scene structure to retain in its map that is sufficient to complete the tasks.
The second contribution is an algorithm for task-driven 3D scene understanding based on an Agglomerative IB approach, that is able to cluster 3D primitives in the environment into taskrelevant objects and regions
基于以上，实现了一个实时的pipeline

提出了针对不同任务需要不同粒度的语义信息，本文是通过结合SAM和[[CLIP多模态预训练模型]]实现，但是忽略了物体之间的谓语关系或者父子关系。本质还是智能做导航，拾取，放下，导航的基本操作。

Posted 2025-03-18Updated 2025-07-24Reviewa few seconds read (About 3 words)

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Posted 2025-03-18Updated 2025-07-24Reviewa few seconds read (About 31 words)

Representation Learning for Scene Graph Completion via Jointly Structural and Visual Embedding

The architecture of RLSV is a three-layered hierarchical projection that projects a visual triple onto the attribute space, the relation space, and the visual space in order.

Posted 2025-03-18Updated 2025-07-24Reviewa few seconds read (About 7 words)

Visual Translation Embedding Network for Visual Relation Detection

VTransE

Posted 2025-03-16Updated 2025-07-24Reviewa minute read (About 112 words)

Factorizable Net= An Efficient Subgraph-based Framework for Scene Graph Generation

The **extensibility** and **inference speed** of a SGG framework is crucial for accelerating down-stream tasks. This paper studied the efficiency and scalability in SGG ## Insights 最大的亮点是将相似的相互作用区域的对象对聚集到子图中并共享短语表示（称为子图特征），然后再在子图上refine