Chen Yulin's Blog

Posted 2025-03-13Updated 2026-06-22Reviewa few seconds read (About 0 words)

Posted 2025-03-12Updated 2026-06-22Reviewa few seconds read (About 94 words)

SceneGraphFusion- Incremental 3D Scene Graph Predictionfrom RGB-D Sequences

Overview of the proposed SceneGraphFusion framework. Our method takes a stream of RGB-D images a) as input to create an incremental geometric segmentation b). Then, the properties of each segment and a neighbor graph between segments are constructed. The properties d) and neighbor graph e) of the segments that have been updated in the current frame c) are used as the inputs to compute node and edge features f) and to predict a 3D scene graph g). Finally, the predictions are h) fused back into a globally consistent 3D graph.

Posted 2025-03-06Updated 2026-06-22Reviewa minute read (About 206 words)

Semantic-SAM

这片文章可以成为场景物理重建的基石之一
类似的后续工作有OMG-Seg

是什么

通用图像分割模型，以实现细分并识别任何所需粒度的任何内容

Semantic-awareness
Granularity-abundance

数据集

### 难点 - 目前的一些通用的物体和分割数据集虽然确实提供了大体量的数据和丰富的语义信息，但只局限在object level。 - 目前一些分割了细分part的数据集却体量有限。 - SAM使用的数据集为多粒度的大体量数据集，但是并不包含语义标注。 ### 解决方式合并了多个不同分割粒度的数据集

Posted 2025-03-06Updated 2026-06-22Reviewa few seconds read (About 26 words)

MaskDINO

注：此DINO并非自蒸馏自监督的那个[[DINO]]，而是派生自[[DETR]]

Posted 2025-03-03Updated 2026-06-22Reviewa few seconds read (About 0 words)

ZegCLIP

Posted 2025-03-03Updated 2026-06-22Reviewa few seconds read (About 108 words)

BLIP

A vision-language model that unifies vision-language understanding and generation tasks.

主要分为两块工作：

去除图文检索所使用的数据集中的噪声
vision language understanding and generation

Model

Noise Filtering

Caption 模型生成图像文本对，然后Filt将caption和真实互联网数据（可能存在噪声）进行对比，如果差异过大则使用Caption模型生成的结果

Understanding & Generation

Posted 2025-02-18Updated 2026-06-22Reviewa few seconds read (About 0 words)

Extract Free Dense Labels from CLIP

Posted 2025-02-17Updated 2026-06-22Review2 minutes read (About 297 words)

ConceptFusion

## Approach 目标是构建一个open-set multimodal 3D map `M`. 可以使用特定于模态的编码器（基础模型）$F_{Mode}$将图像，文本，音频和点击等多维信号编码为矢量空间其中，`M` 由一系列点构成，每个点都包含：顶点位置，法向向量，置信度数量，颜色和概念向量（concept vector）组成首先是帧（单张输入图片）预处理：通过一系列输入的深度图片获取顶点法相maps和相机方位，再通过计算获得每张图片中每个像素的语义上下文嵌入。其中，语义上下文的嵌入是通过结合局部和全局的CLIP features获得的。

然后再进行特征融合：通过相机的方位将每个帧的顶点和法相图映射到全局坐标系。对于帧$X_{t}$中的每个像素$(u，v)_t$，都在`M`中具有相应的点$P_k$

将不同帧$X_t$中的特征集合在M中特征点的公式：

Posted 2025-02-16Updated 2026-06-22Reviewa few seconds read (About 17 words)

Gounded-SAM

https://github.com/IDEA-Research/Grounded-Segment-Anything

By [[Grounding-DINO]] + SAM
Achieving Open-Vocab. Det & Seg

Posted 2025-01-06Updated 2026-06-22Note5 minutes read (About 790 words)

LERF- Language Embedded Radiance Fields

NeRF+CLIP

Intro

背景

神经辐射场 (NeRF) 已成为一种强大的技术，用于捕获复杂的现实世界 3D 场景的逼真数字表示。然而，NeRF 的直接输出只不过是一个彩色的密度场，缺乏意义或上下文，这阻碍了构建与生成的 3D 场景交互的界面。
自然语言是与 3D 场景交互的直观界面。考虑厨房的捕获。想象一下，能够通过询问“用具”在哪里来导航这个厨房，或者更具体地说，询问可用于“搅拌”的工具，甚至可以询问您最喜欢的带有特定功能的杯子。其上的徽标——贯穿日常对话的舒适和熟悉。这不仅需要处理自然语言输入查询的能力，还需要能够在多个尺度上合并语义并与长尾和抽象概念相关。

解决方案

一个Language Field
通过优化从现成的视觉语言模型（如 CLIP）到 3D 场景的嵌入，为 NeRF 中的语言奠定基础。
LERF 提供了一个额外的好处：由于我们从多个尺度的多个视图中提取 CLIP 嵌入，因此通过 3D CLIP 嵌入获得的文本查询的相关性图与通过 2D CLIP 嵌入获得的文本查询的相关性图相比更加本地化。根据定义，它们也是 3D 一致的，可以直接在 3D 字段中进行查询，而无需渲染到多个视图。

相较于Clip-Field[[CLIP-Fields- Weakly Supervised Semantic Fields for Robotic Memory]], LERF 更密集。

CLIP-Fields [32] and NLMaps-SayCan [8] fuse CLIP embeddings of crops into pointclouds, using a contrastively supervised field and classical pointcloud fusion respectively. In CLIP-Fields, the crop locations are guided by Detic [40]. On the other hand, NLMaps-SayCan relies on region proposal networks. These maps are sparser than LERF as they primarily query CLIP on detected objects rather than densely throughout views of the scene. Concurrent work ConceptFusion [19] fuses CLIP features more densely in RGBD pointclouds, using Mask2Former [9] to predict regions of interest, meaning it can lose objects which are out of distribution to Mask2Former’s training set. In contrast, LERF does not use region or mask proposals.