Chen Yulin's Blog

Posted 2025-08-22Updated 2025-11-28Note3 minutes read (About 404 words)

process:

trained:
- swint_mcm
- swint_cb_mcm
- swint_cb_scm
- resdcn_scm
- resdcn_cb_scm
- resdcn_cb_mcm
- resdcn_mcm
- swint_scm
training:
evaluated:
- swint_mcm
- swint_cb_scm
- swint_cb_mcm
- resdcn_cb_scm
- resdcn_cb_mcm
- resdcn_scm
- resdcn_mcm
- swint_scm
evaluating:
evaluated on 1000 samples

KAF-Net: Part-level Kinematic Relation Graph Generation For Robot Manipulation

backbone	In. Aug	class balance	$mAp_{50}$	$R@10/20/40$	$mR@10/20/40$
SwinT	MCM	yes	24.2	33.4/53.8/73.9	25.1/52.9/69.2
		no	23.4	43.5/62.3/78.7	26.9/55.3/67.9
	Single	yes	23.5	32.3/51.6/69.2	34.4/52.9/65.4
		no
ResDCN	MCM	yes	23.1	32.8/52.8/70.1	24.8/52.8/67.7
		no	train~
	Single	yes	20.5	39.9/54.3/69.3	37.7/51.2/65.1
		no	22.3	33.8/52.4/69.4	35.7/55.6/66.7
On `swint_cb_mcm`:

Image-Mask Branch	$mAp_{50}$	$mR@10/20/40$
Yes
No

VLM	VI	VR	unVR
Gemini 2.5 Flash
Pixtral 12B

Task Planning:
example:

https://sankeymatic.com/build/?i=PTAEGUEMFsAcBsCmoBSB7ARhRkBOBjAC1ABMBLSAc1xgC4AoegFTQBdJ5RXcL4BnUAG0ALAF1QABTysAtH0SUAOgDtQAM0hl4AV1zIAxIgCMAVgDsJgJzM2HLjw4DBJgGzipuWfKWq%2B2%2DPiIfAKMHl4KKuqaOnpCRuJMkHwA1qB6SABukMqsoGTKfOzKgZHKbKDe0Ig5iCT0YXIRqhpausiCAMwJSanpiFk5oLDSpeWV1ay1odKNPhX%2BgcFxbpIzAJLKami4VSQQAOL7oJEtMQbG5lb1M96RfgFBTsIADO7rm9u7B0d3%2DoiPoGmnhkGy2O1q32OzWibSEAA5xAB1XBoZSUUDJfKIaCQVhkfBpRDwXFkVGkRCTfB41HXYGgz4Q8CHKFRVqxTpIlFoobSdTaYrU1QkCmIKmk5S02T08F7Jk%2DaFs9pdUAbQrZQIyTJEyGnNqSkEfGWQk4w9kAJk5qPRADUADIAWTyBSKgXJfHwPFggsY%2BulXzlLPu%2DyWzm6KSGxOUyny6N1sUMpgs1gafsZzN%2BgQBnRMYdSCGy0e5QeCjES4fzUZjrLOQmVyKtXB6Mj6A1yfgw1EgsEIzB6EYLVbj7XioHr3MgYrJYK4hGQw08oGk%2BO0xO9ZbzkcLsdNw%2DEADFPrjQIhcCjcPQ%2BGQAF7IADuoCMRmez3ooGILhML5xuEo%2BVAnEfV9cAfM1X1yIw4VfLAzRfDtQHwNB4G2UB9DUND0LAmgCnnCZQAAOXoMphVAe8zRMV9iA6FxXz4YZ8CrYRrFADBtmFYCXxnbFkAwV8EKQ4D9DhIThNfNA6LIVgAE8H3oNQkPvfBdCyVhYWeAA6YRqKdWceFYNQUWgRsfwpXjEOQ%2DRLEsqzRPEqTQHU4RyOJSS0G0XJWJPRc3LQHE8XwV8ACttEKMg1BcnhfwKfDAuCvEwuqEgBAIwkMhPbwaG7aLF1YdgiHyBC4CQSZCjQUBlBwPRCnoYkMCJAQ%2BPM58mo4wgyGI5LWsoQh4DILrcnUsjXy2HINFdeQeDUV8evK2iJyrDj0hJVLLxvB8INfHFKGjMKH2farIFq%2BBlBgZAu1gCqBAATRo69kCMMw1PIkjEF6whcheF8aqJLIdFO2BzrwK6hpXeBYD0ejLzJa6hjQS9BUXFjUtfW8Xr60APv2w7YFhiTxS81g0A4XrVCMGiiC4oYTwAfTVShECGshcEKRc1Emc9mL0SBkmx%2DJclAn7tGQMFfNAAByAAaNTRdfMHEDUMgAA8xel%2BY0MV5X6FYWcqjQND5FySBQC0rBIPg%2BzXz2L8KUNqocnFPhsmSRBJN8%2DFQGhnqQugDAOHVIJ3aAA

https://sankeymatic.com/build/?i=CoewLghgNgBGBOBLaBnGBtAnAXRgBQnjAFoUBTAcxgDMJEoBXeMmAYjIEYBWAdi8wBQoSLATIoadBw64CRUpRgoGAYxVkUKAQLklyVWvSYt0ADlzAIKANYxmUMgDcIAOzAxELlJBfqYL8CVKAFsyNzIAEx1CPUVDRmYMGRhLGzsyB2c3GAAHGP9A%2DVDwqOj5fSVVdU0k2RjiAEkXahB4UIiYAGUAcW6aOgSWdm4%2BQV0FKmU1DUkOAAY6%2BSaWtsiu3srpzW1x5db29b744wwFmAB1eBAXKmtPMmCIMEQVdKgnxGuYCLIwMhVntcyiQ9qsOj0jgMTugzpdrlQ8kQaAxfICXN9fv80cDGs19msIf0jIkYbgmt5XOpiJkMociYMcaCDoTjiTknCbjAAGoAGQAsh4vD4%2DD8UCokDlsYy8WC6VNqpIuBYrLYcu8XC5PAYoYlhrx%2BNKVsyNvKZhglSkVbl1ZrOabtkIrWrXLbtcSTAAmXAcqiQGzEexOVzuZQAIwo8AgOQAFo60s6NVr6dDYVdORAAZ90Ss4NGWIj3DEXgx3tjUqqbUnWSYAMy4ABi%2ByeMDI8Cu8AEKEQAC8WAB3GDSOZzAQwaMwABsXBHj3gFE8MFg0lH8EHHtH7g4plHoZgHpH4ZgKhAUFabGoF8vG8jXkRYXcADkBAEfjABx6uKPxzWJ6OUHkVCTAAWQQYFDVoflXEdcweFhQ1HY9T1XVhTFQtDRxAADEDAABPQcBGoU8BxUJhnDAE45gAOiA39BTzJAwGoK5gjgQgKF%2BBCTzPVhMF4viMKw3CYCooDP3eHCQAYdwINbGAICkkBHmeFRRwAKwYbxEGoCSkHnLwYCfGB1M07SwgiNBDOYRxW30SMYwM0cnkgFRo08Y9gjVTFvBAfwyEIDQwAEd5QwyNBEO44dIug1zX0M1yKGjKBEAS9wqI%2DUcWjcWg%2DHIJBqFHJKXA0LDOWg%2BwPmsrte0HLdR0eChNW0wdhyCiAQqgFwIFCOSchyPz4DQABNP8exYDgeEoz83zIZLo3cICWuCjJnEYFgoz6wghoyksoByZhAK7L5htyEAuzROTwOs0c%2BxmlKYAWkclt207sKzOSFOgZL0Q4P8XNg3JWwAfQpdiMsQAbC2oP4OzA5gIGsHIQE8dx1xWhgWCNZsAHIABpKKx0c9rIahEAADxgLGCcqC8yYpgmwDzUIQAvchC0nHcYG3I9hNHDoZ1%2BCAYGKNEUFcawyBwpSXhgY6ks04JQ2gSkNBloA

Posted 2025-08-22Updated 2025-11-28Notea few seconds read (About 98 words)

PSGBOT Analysis Table

process:

trained:
- swint_mcm
- swint_cb_mcm
- swint_cb_scm
- resdcn_scm
- resdcn_cb_scm
- resdcn_cb_mcm
training:
- swint_scm
evaluated:
- swint_mcm
- swint_cb_scm
- swint_cb_mcm
- resdcn_cb_scm
- resdcn_cb_mcm
- resdcn_scm
evaluating:
- resdcn_mcm
  evaluated on 1000 samples

KAF-Net: Part-level Kinematic Relation Graph Generation For Robot Manipulation

backbone	In. Aug	class balance	$mAp_{50}$	$R@10/20/40$	$mR@10/20/40$
SwinT	MCM	yes	24.2	33.4/53.8/73.9	25.1/52.9/69.2
		no	23.4	43.5/62.3/78.7	26.9/55.3/67.9
	Single	yes	23.5	32.3/51.6/69.2	34.4/52.9/65.4
		no	24.2	32.7/51.4/68.6	25.2/53.5/66.3
ResDCN	MCM	yes	23.1	32.8/52.8/70.1	24.8/52.8/67.7
		no	23.4	36.9/52.5/69.9	23.9/47.8/62.4
	Single	yes	20.5	39.9/54.3/69.3	37.7/51.2/65.1
		no	22.3	33.8/52.4/69.4	35.7/55.6/66.7
On Swin Transformer with class balance:

Image-Mask Branch	$mAp_{50}$	$mR@10/20/40$
Yes
No

VLM	VI	VR	unVR
Gemini 2.5 Flash
Pixtral 12B

Posted 2025-04-16Updated 2025-11-28Reviewa few seconds read (About 3 words)

RoboEXP

Posted 2025-04-14Updated 2025-11-28Notea few seconds read (About 95 words)

(Mindmap) Part-level Scene Understanding for Robots

概念梳理

Scene Graph

A scene graph is a structural representation, which can capture detailed semantics by explicitly Modeling:

objects (‘‘man’’, ‘‘fire hydrant’’, ‘‘shorts’’)
attributes of objects (‘‘fire hydrant is yellow’’)
relations between paired objects (‘‘man jumping over fire hydrant’’)

A scene graph is a set of visual relationship triplets in the form of <subject, relation, object> or <object, is, attribute>

Scene graphs should serve as an **objective semantic representation** of the state of the scene

Posted 2025-03-25Updated 2025-11-28Note5 minutes read (About 724 words)

(Roadmap) Deeper Scene Graph For Robots

针对的问题（任务场景）

Robotic planning and execution in open-world environments is a complex problem due to the vast state spaces and high variability of task embodiment.
例如针对家用场景：

OVMM Challenge: https://aihabitat.org/challenge/2023_homerobot_ovmm/
想要在这样复杂场景中执行 general, long-horizon, embodied tasks 需要生成一系列离散的动作，这些动作在都拥有累计和传播错误的可能。因此需要创建一个可行的计划并在该计划出现问题时恢复，需要对物理环境进行有效的抽象以及能够完全利用该抽象的planner。应对这些挑战需要整合自然语言理解，多粒度的场景抽象和理解以及有弹性的推理。

目前粗粒度（object-level）的场景抽象（场景图构建）已经有许多工作了，详见Reconstruct-Anything Literature Review，在这些工作中，重点都在于object detection和 object-level visual relationship detection

需要聚焦的部分是多粒度的场景抽象
需要多粒度的原因：

Scalability: 如果只有一个粒度，那么输入LLM的场景图token不可控，影响扩展性
想要和物体进行更复杂的交互（相较于抓取），需要明确物体各个part的位置，语义性质，和父物体的parent-child relationship。这就要求场景图的生成需要考虑更细粒度。
针对不同复杂度的物体，需要的物体粒度层级不同
对于不同任务，需要的物体粒度也不同。
具体案例（任务需要的颗粒度层次）：
<Task>给水壶加水：
- <object-level>水壶
  - <part-level>壶盖
  - <part-level>把手
- <object-level>饮水机
  - <part-level>操作面板
    - <part-level>绿色按钮（常温水）
    - <part-level>红色按钮（开水）
    - <part-level>童锁
  - <part-level>水槽
- <object-level>桌子
  - <part-level>桌面
<Task>离开房间
- <object-level>门
  - <part-level>把手
  - <part-level>纸条：“离开房间前把玩偶放回红筐”
- <object-level>黄鸭玩偶
- <object-level>红框

在更细粒度（part-level）的场景抽象中，重点在于子物体和父物体关系的识别

除此，和object-level scene graph中的object detection相对的，是part-level scene graph的子物体语义的多粒度分割和语义信息提取，可以由现有的Semantic-SAM和类似CLIP或者其他多模态模型的语义特征提取器实现。

主要的研究流程

明确研究对象Parent-child Relationship

What aspects does parent-child relationship include?

语义构成关系，即这个子物体的存在与否给父物体的语义带来了什么改变 Translation in embedding space.
kinematic relations，也就是需要把一个物体以一个运动学树的形式构建出来

项目流程的流程

自监督的特征提取方法

Posted 2025-03-19Updated 2025-11-28Reviewa few seconds read (About 42 words)

ConceptGraphs= Open-Vocabulary 3D Scene Graphs for Perception and Planning

通过LLM来判断位置关系，以此构建scene graph

还是只能判断object-level空间关系，做不了part-level manipulation

Posted 2025-03-18Updated 2025-11-28Review2 minutes read (About 355 words)

SayPlan= Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning

主要的思想都在上面这个伪代码里，通过只展开部分场景图（严格层级结构），来控制输入llm的场景图大小。

A scalable approach to ground LLM-based task planners across environments spanning multiple rooms and floors

Scene Graph 通过networkx (python package)表示

Three key innovations:

通过Collapsed 3DSG来在少数根节点上寻找task-relevant子图（后续通过展开子图进行进一步的搜寻），提高了scalability（避免过于复杂的整体场景图超过LLM的token限制）
环境中任务计划的horizon会随着给定任务的复杂性而增长，LLM会倾向于产生幻觉或者不可行的动作序列。所以通过成熟的path planner such as Dijkstra来连接high-level nodes。
An iterative replanning pipeline in order to correct for any unexecutable actions
- Missing to open the fridge before putting something into it
  因此，避免由于环境本身的物理限制和谓词的矛盾，幻觉或不一致而导致的计划失败。

Insight

每一次场景节点的展开与否，该节点是否是任务关注的节点都是由LLM决定的，这一点和我的想法一致。等于是将LLM作为一个检查器一层层遍历查找任务的兴趣点。
Scene Graph Simulator作为任务是否可行的验证器。

Posted 2025-03-18Updated 2025-11-28Reviewa minute read (About 197 words)

Clio= Real-time Task-Driven Open-Set 3D Scene Graphs

贡献：

The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language, and has to select the granularity and the subset of objects and scene structure to retain in its map that is sufficient to complete the tasks.
The second contribution is an algorithm for task-driven 3D scene understanding based on an Agglomerative IB approach, that is able to cluster 3D primitives in the environment into taskrelevant objects and regions
基于以上，实现了一个实时的pipeline

提出了针对不同任务需要不同粒度的语义信息，本文是通过结合SAM和[[CLIP多模态预训练模型]]实现，但是忽略了物体之间的谓语关系或者父子关系。本质还是智能做导航，拾取，放下，导航的基本操作。

Posted 2025-03-18Updated 2025-11-28Reviewa few seconds read (About 3 words)

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Posted 2025-03-18Updated 2025-11-28Reviewa few seconds read (About 31 words)

Representation Learning for Scene Graph Completion via Jointly Structural and Visual Embedding

The architecture of RLSV is a three-layered hierarchical projection that projects a visual triple onto the attribute space, the relation space, and the visual space in order.

training:

evaluating: